← BACK
feature image
April 26, 2023
By ZK_TUTORIALS
Loan decisions with neural networks using Leo

Copying Code

Throughout this blog, we will be referencing snippets of code. You can copy these snippets by clicking the top right of the box. The complete source code for this article can be found on GitHub here.

Introduction

In our previous articles, we delved into fixed-point numbers and neural networks in Leo. Now, we will explore the potential applications of neural networks in zk-SNARKs, with a specific focus on loan decisions. These decisions have become increasingly significant in DeFi applications as they are crucial in providing cheap loans to borrowers while maximizing returns for lenders.

To begin with, we will examine the German credit dataset, a commonly used machine learning credit dataset. Using PyTorch and Python, we will train a neural network on this dataset to predict whether a borrower will pay the loan back or default based on various parameters such as their employment status and purpose of credit.

After successfully training the neural network, we will assess its effectiveness and utility. Once satisfied with the results, we will transfer the neural network to Leo and evaluate its performance in this environment.

By integrating neural networks into zk-SNARKs, we can improve the accuracy of loan decisions, resulting in more efficient and profitable DeFi applications. You can find the code we use in this article in the same GitHub repository as the code for our last article in the two application folders.

The German credit dataset

The German credit dataset is a well-known dataset in the field of machine learning that was first published in 1994. It is publicly available for download and consists of 1000 instances, meaning 1000 cases of data where a credit decision was made. Each instance contains data about the credit circumstances and whether or not the credit defaulted in that particular case. Specifically, each instance contains 20 attributes, such as the duration of the credit, the employment status of the applicant, the purpose of the credit, and the credit history of the applicant. In addition, each instance in the dataset is labeled with one bit of information indicating whether the credit ultimately defaulted or not.

Training the neural network using PyTorch and Python

To create the neural network, we use the popular deep learning library PyTorch and the Python programming language. We use a multilayer perceptron (MLP) feedforward neural network architecture, which we also used and explained in more depth in the last article. For the 20 input features, we create 20 input neurons in the first layer. We also need 2 output neurons for the 2 possible output classes, loan payback or default, in the third layer. A rule of thumb for MLP network architecture design is to have one hidden layer with the average number of neurons of the input and output layer - this provides us with 11 neurons in the hidden (second) layer.

The following code first loads the dataset and then creates a subset, referred to as training dataset. Furthermore, it creates the MLP neural network architecture. Then, it normalizes the training data and trains the neural network. It then stores the trained neural network and the normalization parameters on the hard drive. Since the computational power of a CPU is sufficient here, we train using the CPU. For larger networks and datasets, a GPU may be necessary.

import torch import torch.nn as nn from torch.utils.data import DataLoader, TensorDataset import pandas as pd from sklearn.model_selection import train_test_split import pickle # load the german.data-numeric data set data = pd.read_csv('german.data-numeric', delim_whitespace=True, header=None, on_bad_lines='skip') # define the neural network class MLP(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(MLP, self).__init__() self.fc1 = nn.Linear(input_size, hidden_size) self.fc2 = nn.Linear(hidden_size, output_size) def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return x X = data.iloc[:, 0:20]#df.iloc[:, :-1]#df.iloc[:, 0:6]#df.iloc[:, :-1] y = data.iloc[:, -1] - 1 # split training and testing data x_train, _, y_train, _ = train_test_split(X, y, test_size=0.2, random_state=0) # normalize the data x_train_mean = x_train.mean() x_train_std = x_train.std() x_train = (x_train - x_train_mean) / x_train_std # convert pandas dataframes to tensors x_train = torch.tensor(x_train.values, dtype=torch.float32) y_train = torch.tensor(y_train.values, dtype=torch.long) # combine the data into a dataset and dataloader dataset = TensorDataset(x_train, y_train) train_loader = DataLoader(dataset, batch_size=32, shuffle=True) model = MLP(20, 10, 2) # define the loss function and optimizer criterion = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # train the model for epoch in range(100): for inputs, labels in train_loader: optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # save model torch.save(model.state_dict(), 'model.pt') # save the mean and standard deviation using pickle with open('mean_std.pkl', 'wb') as f: pickle.dump((x_train_mean, x_train_std), f)

Evaluating the trained neural network

We create a separate file to evaluate the trained neural network. The core idea is to evaluate it on another subset of the entire dataset that is different to the training data used for training. We refer to it as the testing dataset. We then load the stored neural network and normalization parameters and evaluate the neural network. For the evaluation, we use the area under the receiver operating characteristic (AUROC) performance metric. This metric is especially suited for imbalanced datasets, as in our case, where other metrics such as the classification accuracy are not helpful. The maximum AUROC value in the case of a perfect classifier is 1. The following code conducts these steps and computes the AUROC metric.

import torch import torch.nn as nn from torch.utils.data import DataLoader, TensorDataset import pandas as pd from sklearn.model_selection import train_test_split import pickle # load the data set data = pd.read_csv('german.data-numeric', delim_whitespace=True, header=None, on_bad_lines='skip') # define the neural network class MLP(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(MLP, self).__init__() self.fc1 = nn.Linear(input_size, hidden_size) self.fc2 = nn.Linear(hidden_size, output_size) def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return x X = data.iloc[:, 0:20]#df.iloc[:, :-1]#df.iloc[:, 0:6]#df.iloc[:, :-1] y = data.iloc[:, -1] - 1 # split training and testing data _, x_test, _, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # open the pickle file to load the train mean and std with open('mean_std.pkl', 'rb') as f: [x_train_mean, x_train_std] = pickle.load(f) x_test = (x_test - x_train_mean) / x_train_std # load the model model = MLP(20, 10, 2) model.load_state_dict(torch.load('model.pt')) # test the model x_test = torch.tensor(x_test.values, dtype=torch.float) y_test = torch.tensor(y_test.values, dtype=torch.float) test_data = TensorDataset(x_test, y_test) test_loader = DataLoader(test_data, batch_size=32, shuffle=False) from sklearn.metrics import roc_auc_score with torch.no_grad(): running_predicted_tensor = torch.tensor([]) for inputs, labels in test_loader: outputs = model(inputs) _, predicted = torch.max(outputs.data, 1) running_predicted_tensor = torch.cat((running_predicted_tensor, predicted), 0) auc = roc_auc_score(y_test, running_predicted_tensor) print('AUC: {}'.format(auc))

Result: AUROC: 0.6897765905779504

We obtain an AUROC of ca. .69, which is quite good. While we could focus on even further improving that value, such work has already been done in other works. The result is good enough for our purposes and we focus now on transferring the neural network to Leo.

Transferring the neural network to Leo

To transfer the neural network from a PyTorch model to Leo, we can use the software we developed in the last article - a Python program that automatically generates the code for a neural network architecture, given the desired number of neurons per layer as the input. The Python code generates a main.leo and input.in file for both, the Leo circuit code, and the input parameters. Our parameters are 20 input neurons, 10 hidden neurons, and 2 output neurons. For Leo, we use i16 variables, meaning the variables can take both, positive and negative values. Thereby, one of the 16 bits is reserved for the sign. Since the input values are normalized and their absolute value mostly distributed between 0 and 1, we need a high accuracy with regard to decimal places for number representing and computing. We thus use a scaling factor of 2^7, meaning we reserve 7 bits for the fractional part, and the remaining 8 bits are for the integer part. The main.leo code can be found here.

We then need to create the input parameters file based on the actual parameters from the Python file - meaning we provide the neural network input parameters and the data instance attributes in the fixed point format. In this specific case, we use the first data instance from the testing dataset. For the extraction, we use the following Python code:

import torch import torch.nn as nn from torch.utils.data import DataLoader, TensorDataset import pandas as pd from sklearn.model_selection import train_test_split import pickle import math # load the data set data = pd.read_csv('german.data-numeric', delim_whitespace=True, header=None, on_bad_lines='skip') # define the neural network class MLP(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(MLP, self).__init__() self.fc1 = nn.Linear(input_size, hidden_size) self.fc2 = nn.Linear(hidden_size, output_size) def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return x X = data.iloc[:, 0:20]#df.iloc[:, :-1]#df.iloc[:, 0:6]#df.iloc[:, :-1] y = data.iloc[:, -1] - 1 # split training and testing data _, x_test, _, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # open the pickle file to load the train mean and std with open('mean_std.pkl', 'rb') as f: [x_train_mean, x_train_std] = pickle.load(f) x_test = (x_test - x_train_mean) / x_train_std # load the model model = MLP(20, 10, 2) model.load_state_dict(torch.load('model.pt')) for key in model.state_dict().keys(): print(key) print(model.state_dict()[key]) str_list_inputs = [] str_inputs = "" str_list_inputs.append("[main]\n") for i, key in enumerate(model.state_dict().keys()): if(i==0): first_layer = model.state_dict()[key] layer_name = "" if('weight' in key): layer_name = "w" if('bias' in key): layer_name = "b" value = model.state_dict()[key] for j, val in enumerate(value): # get dimension of the value if(len(val.shape) == 1): for k, val2 in enumerate(val): val_fixed_point = int(val2 * 2**7) #variable_line = layer_name + str(math.floor(i/2)+1) + str(k) + str(j) + ": " + "u32 = " + str(val_fixed_point) + ";\n" variable_line = layer_name + str(math.floor(i/2)+1) + str(k) + str(j) + ": " + "u32 = " + str(0) + ";\n" str_list_inputs.append(variable_line) else: val_fixed_point = int(val * 2**7) #variable_line = layer_name + str(math.floor(i/2)+1) + str(j) + ": " + "u32 = " + str(val_fixed_point) + ";\n" variable_line = layer_name + str(math.floor(i/2)+1) + str(j) + ": " + "u32 = " + str(0) + ";\n" str_list_inputs.append(variable_line) str_list_inputs.append("\n") # load the data set data = pd.read_csv('german.data-numeric', delim_whitespace=True, header=None, on_bad_lines='skip') X = data.iloc[:, 0:20]#df.iloc[:, :-1]#df.iloc[:, 0:6]#df.iloc[:, :-1] y = data.iloc[:, -1] - 1 # split training and testing data _, x_test, _, y_test = train_test_split(X, y, test_size=0.2, random_state=0) # open the pickle file to load the train mean and std with open('mean_std.pkl', 'rb') as f: [x_train_mean, x_train_std] = pickle.load(f) x_test = (x_test - x_train_mean) / x_train_std x_test = torch.tensor(x_test.values, dtype=torch.float) y_test = torch.tensor(y_test.values, dtype=torch.float) test_data = TensorDataset(x_test, y_test) test_loader = DataLoader(test_data, batch_size=32, shuffle=False) for i in range(len((first_layer[0]))): value = test_data[0][0][i] val_fixed_point = int(value * 2**7) str_list_inputs.append("input" + str(i) + ": u32 = " + str(val_fixed_point) + ";\n") str_list_inputs.append("\n") str_list_inputs.append("[registers]") str_list_inputs.append("\n") str_list_inputs.append("r0: [u32; 2] = [0, 0];") with open("project.in", "w+") as file: file.writelines(str_list_inputs)

We now run the code and obtain this input file.

When evaluating the data instance in the PyTorch neural network, we obtain the following output vector from the neural network:

tensor([ 1.8403, -2.2161])

The fact that the first value is much higher than the second one indicates that the data instance belongs to class 0 (loan will be paid back), which it indeed does when looking at the label of the data instance in the dataset.

Evaluating the Leo neural network

When running the leo circuit using the “leo run”, we obtain the following output vector in the circuit:

222, -272

To interpret the numbers in a decimal system, we need to divide these by the scaling factor of 128. Thus, we obtain the following decimal result:

1.73 -2.125

This is very close to the floating point computation from above in Python! Thus, the decision based on the result - granting the loan, since the first output neuron is much higher than the second one - still holds true. Thus, our decision to have a high scaling factor proved successful in having an accurately working fixed point neural network.

We now further analyze the output from the “leo run” command:

Build Starting... Build Compiling main program... ("/home/user/Aleo Studio/project/src/main.leo") Build Number of constraints - 2357872 Build Complete Done Finished in 11838 milliseconds Setup Starting... Setup Saving proving key ("/home/user/Aleo Studio/project/outputs/project.lpk") Setup Complete Setup Saving verification key ("/home/user/Aleo Studio/project/outputs/project.lvk") Setup Complete Done Finished in 117657 milliseconds Proving Starting... Proving Saving proof... ("/home/user/Aleo Studio/project/outputs/project.proof") Done Finished in 78358 milliseconds Verifying Starting... Verifying Proof is valid Done Finished in 11 milliseconds

The Leo circuit yields in 2.3M constraints, suggesting it is feasible to run neural networks of such size or even more complex size nowadays in zk-SNARKs using Leo.

Conclusion

We were able to run an MLP neural network in fixed-point numbers using the LEO programming language, where the accuracy of the computation is high even for critical applications such as loan decisions. The computational overhead is very reasonable with contemporary hardware, suggesting that the technology is ready for applications in practice. We demonstrated such a use case using a machine learning credit dataset. This can be useful for on-chaining AI logic into smart contracts, while the zero-knowledge aspects can hide personal data and proprietary machine learning models. It will remain fascinating to see applications evolve over the coming time. You can find all of the code in this Github Repository.

© 2024 Provable Inc.