NEURAL NETWORK FROM SCRATCH WITH NUMPY AND PYTHON ¶

Implementing deep learning algorithms is easier now than ever considering high level libraries like tensorflow, pyTorch and keras have matured considerably and has become very user friendly. But it comes with a disadvantage of being a black box and understanding the basic functioning of a NN is critical in deciding several attributes like type of data, type of architectures etc.. If we want to know what happens to our data and how we end up with a prediction label, we need to dig deeper into the fundamental building blocks of the network.

Taking this oppurtunity, we will code a classifier from scratch with only native python libraries like numpy and math and learn elements of a neural network like,

forward propogation
backward propogation
activation function
cost function

We will look into these individual elements and try to implement them without using any high level libraries. But first lets go through the basics starting with what the 'neuron' part in a neural network is and what it does!

A simple neuron¶

Imagine neuron as a simple mathematical function, that takes in an input x and applies f(x) and produces output y. But what is that f(x)? Simply put it is the dot product of input x and weights w which is then summed with bias b. weights, biases, dot product??? We will look at a sample python code first.

inputs = [4,5,6]

weights = [1,2,3]
bias = 1

output = inputs[0]*weights[0] + inputs[1]*weights[1] + inputs[2]*weights[0] + bias
print(output)

21

Dot product can also be called as the element wise multiplication. What are weights and biases anyway? Imagine you want to create a function that takes in a vector and you want the function to ouput another vector. You know both these vectors, but you don't know the function here. weights and bias would the randomized element here that we would multiply and add such that we end up with the output vector. We will run the function through many iterations till we find the perfect or near perfect combination of weights and bias for that particular input vector. Bias and Weights are the trainable parameters of a neural network. This is an oversimplification, you might ask we could do the problem with just bias alone or weight alone, which is true, but as we move further we will see why it is important to have both.

Now when it comes to deep learning, inputs are generally in matrix form which would be called a tensor. A tensor is a object that can be represented as an array. We will do the same operation above but with >1D data and to make things easier we will use python's native linear algebra library NumPy.

import numpy as np

inputs = [1,2,3]

weights =  np.array([[0.1,0.2,0.3],
                    [0.5,0.6,0.7],
                    [1.0,1.1,1.2]])

biases = [2,3,0.5]

outputs = np.dot(weights,inputs) + biases
print(outputs)

[3.4 6.8 7.3]

So the operation here would be $$ (1*0.1+2*0.2+3*0.3+2)+ .... $$

Batches¶

So until now we saw only one dimensional input or also known as one sample. More often than not in real applications there will be n features with n samples. So the inputs will be represented by a matrix. This produces the opportunity for something called batch learning.

When batch = 1, meaning 1 sample is used for training at a time, the model would take forever or maybe never learn anything

when batch = size of input data, meaning all the data is used for training, this will decrease the generalization capability and produce something called overfitting.

The trick is to train the model on the available data in batches. A typical batch_size would range from 8 to 128 and most commonly used value would be 32 or 64. The idea is to split the entire data into 32 batches and train the model sequentially or parallely for each individual batch. Since out problem is too trivial, doing batch learning is unecessary.

inputs = np.array([[1,2,3],
                  [5,6,7],
                  [6,9,10]])

weights =  np.array([[0.1,0.2,0.3],
                    [0.5,0.6,0.7],
                    [1.0,1.1,1.2]])

biases = [2,3,0.5]

outputs = np.dot(weights,inputs) + biases
print(outputs)

[[ 4.9  7.1  5.2]
 [ 9.7 13.9 13.2]
 [15.7 22.4 23.2]]

Until now we have only seen two layers, the input and the output. A neural network can have n number of hidden layers(inbetween input and output layer) and with n number of neuron in each layer. This shape is usually referred to as architecture and is optimized based on the data or the problem we are trying to solve.

For our example, lets try to add one more layer. The first layer will be our input, the second is the hidden layer and last is output.

inputs = np.array([[1,2,3],
                  [5,6,7],
                  [6,9,10]])

weights1 =  np.array([[0.1,0.2,0.3],
                    [0.5,0.6,0.7],
                    [1.0,1.1,1.2]])
weights2 =  np.array([[0.2,0.3,0.4],
                    [0.9,0.5,0.6],
                    [-1.0,-1.1,-1.2]])
            

bias1 = [2,3,0.5]
bias2 = [3,4,-0.1]

layer1_outputs = np.dot(weights1,inputs) + bias1
layer2_outputs = np.dot(weights2,layer1_outputs) + bias2
print(layer2_outputs)

[[ 13.17  18.55  14.18]
 [ 21.68  30.78  25.1 ]
 [-31.41 -45.27 -47.66]]

Forward propogation¶

class FullyConnected:
    def __init__(self,n_inputs,n_neurons):
        self.weights = np.random.randn(n_inputs, n_neurons)
        self.bias = np.zeros((1,n_neurons))
    def forward(self,inputs):
        self.output = np.dot(inputs,self.weights) + self.bias

np.random.seed(0)

X = np.random.randn(4,4)
layer1 = FullyConnected(4,5)
layer2 = FullyConnected(5,2)

layer1.forward(X)
layer2.forward(layer1.output)
print(layer2.output)

[[10.55685647 11.153354  ]
 [ 5.58478598 -4.59996503]
 [ 3.1367388   6.79947482]
 [ 2.41001989  2.72360467]]

Okay. Now we put together what we saw until now into a single object. We randomly initialize weights and bias with shape (size_of_inputs, no_of_neuron). Notice how the shapes change between layer 1 and layer 2, since output of layer1 is 5, input to layer 2 has to be 5. This is the skeleton of forward propogation. This is completley arbitrary and there is no 'learning' happening here, but we will continue further onto other parts of a neural nework. starting with the activation functions and see what they are and why they are useful.

Activation function¶

Generally activation functions are used to introduce non linearity to the model training. Linear functions are severly handicapped and realistically they are useless in real world applications. Activations are applied after in between the feed forward network to all the neuron individually.

There are many non linear activations functions such as sigmoid, tanh, ReLu, softmax etc.. and each has its own advantages and disadvantages. First, we will take a look at a couple of samples.

Sigmoid¶

The main reason why we use sigmoid function is because it exists between (0 to 1). It makes it easier to process the prediction probabilities as an output.

$$ h(x)= \frac{\mathrm{1} }{\mathrm{1} + e^{-x }} $$

$$ \text{where x is the input variable subject to sigmoid function} $$

import numpy as np

def sigmoid(z):
    return 1/(1+np.exp(-z))

ip = np.random.rand(3,3)
op = sigmoid(ip)

print(ip)

[[0.65310833 0.2532916  0.46631077]
 [0.24442559 0.15896958 0.11037514]
 [0.65632959 0.13818295 0.19658236]]

print(op)

[[0.65771057 0.56298651 0.61451019]
 [0.56080398 0.53965891 0.52756581]
 [0.6584354  0.53449087 0.54898793]]

ReLu¶

Rectified linear unit is one of the most commonly used activation funcitons in many neural networks. It is very simple yet very powerful in the learning process.

$$ \text {ReLu} = \max (0,x) $$

The ReLu function outputs the input positive value back or if the input is <=0 then it returns 0.

import random

def relu(z):
    return np.maximum(0,z)

ip = [random.uniform(-1,1) for _ in range(3)]
op = relu(ip)

print(ip)

[-0.7707681474150962, 0.9406125215366139, 0.195550410471103]

print(op)

[0.         0.94061252 0.19555041]

Introducing data¶

Okay let's put together the activations in the object we wrote earlier. But we need some kind of data to do some meaningful learning and not just pass random arrays. We will use a toy data from https://cs231n.github.io/neural-networks-case-study/ that generates a spiral dataset.

def spiral_data(points, classes):
    X = np.zeros((points*classes, 2))
    y = np.zeros(points*classes, dtype='uint8')
    for class_number in range(classes):
        ix = range(points*class_number, points*(class_number+1))
        r = np.linspace(0.0, 1, points)  # radius
        t = np.linspace(class_number*4, (class_number+1)*4, points) + np.random.randn(points)*0.2
        X[ix] = np.c_[r*np.sin(t*2.5), r*np.cos(t*2.5)]
        y[ix] = class_number
    return X, y

import matplotlib.pyplot as plt

X, y = spiral_data(100,2)
plt.scatter(X[:,0],X[:,1],c=y,s=40, cmap=plt.cm.Spectral)
plt.show()

class FullyConnected:
    def __init__(self,n_inputs,n_neurons):
        self.weights = np.random.randn(n_inputs, n_neurons)
        self.bias = np.zeros((1,n_neurons))
    def forward(self,inputs):
        self.output = np.dot(inputs,self.weights) + self.bias

class ActivationLayer:
    def forward(self, inputs):
        self.output = np.maximum(0,inputs)



layer1 = FullyConnected(2,5)
layer2 = FullyConnected(5,2)
relu_activation = ActivationLayer()

layer1.forward(X)
relu_activation.forward(layer1.output)
layer2.forward(relu_activation.output)

ok, now we created a small neural network with feed forward propogation from input --> layer1 --> ReLu --> layer2

We will print some of the values from the layers befpre and after activation to see the difference.

print(layer1.output[5])

[ 0.09840815  0.0408021  -0.02963727 -0.03639039 -0.03389406]

print(relu_activation.output[5])

[0.09840815 0.0408021  0.         0.         0.        ]

Softmax¶

You might haven noticed the zeros and the actual values. But one might ask, aren't the zeros going to be detrimental to the network? Will it stop learning when everything becomes zero? True! this is why it is important to look at the numbers and what is happening inside a network rather than treating it as a black box. In this case one can adjust the biases in a way that zeros would not become a problem.

Ok but what can we do with these number? They do not say much about a problem's solution. Neural network is a predictive model and the outputs must be in probabilities. A sigmoid function squezzes values inbetween 0 and 1 which can be interpreted as probabities. What about ReLu? This is where the Softmax activation comes in. Generally, Softmax is used as the last step after the final hidden layer to spew out the probabilities rather than discrete values.

Softmax units naturally represent a probability distribution over a discrete variable with k possible values, so they may be used as a kind of switch.

Mathematically Softmax is given by,

$$ \text S(y_i) = \frac {\exp(y_i)} {\sum \limits_{i=0}^j \exp(y_j)} \text{where, y in the input vector and exp is the Euler's number}$$

Dealing with exponents is bit risky, as even small change in the input vector can cause integer overflow or memory issues. So we should do another operation before applying softmax.

$$ v_i = v_i - max(v) $$

layer_output = [[4.3,1.21,2.3],
                [9.9,-1.2,-3.45],
                [0.5,-1.5,2.5]]

soft_ip = layer_output - np.max(layer_output,keepdims=True, axis=1)

exp_values = np.exp(soft_ip)
norm_values = exp_values / np.sum(layer_output, keepdims=True, axis=1)

print(sum(norm_values))

[0.40874069 0.01803942 0.68399543]

Okay, now lets put the softmax operation into our object and the run our data that we generated earlier.

class FullyConnected:
    def __init__(self,n_inputs,n_neurons):
        self.weights = np.random.randn(n_inputs, n_neurons)
        self.bias = np.zeros((1,n_neurons))
    def forward(self,inputs):
        self.output = np.dot(inputs,self.weights) + self.bias

class ActivationRelu:
    def forward(self, inputs):
        self.output = np.maximum(0,inputs)


class ActivationSoftmax:
    def forward(self,inputs):
        exp_values = np.exp(inputs- np.max(inputs, axis=1,keepdims=True))
        probabilites = exp_values/np.sum(exp_values, axis=1,keepdims=True)

        self.output = probabilites

    
layer1 = FullyConnected(2,5)
layer2 = FullyConnected(5,2)
activation1 = ActivationRelu()
activation2 = ActivationSoftmax()


layer1.forward(X)
activation1.forward(layer1.output)

layer2.forward(activation1.output)
activation2.forward(layer2.output)

print(activation2.output[:5])

[[0.5        0.5       ]
 [0.50994638 0.49005362]
 [0.53078178 0.46921822]
 [0.54887241 0.45112759]
 [0.53155721 0.46844279]]

Alright, now we have our output as two probaiities representing each class we passed as input. Keep in mind, this is not learning, this is just the result from applying the activation function to the randomly generated values and as a result it is common to have 1/2 equal probabilty distribution. Usually to get the class label, we should do argmax() to get the final prediction.

Cost function¶

Now that we defined the elements of a forward propogation, we can get into the quantification of the performance of the network. We call this cost function and there several ways to do this.

Put simply, a cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between X and y.

Categorical cross entropy loss¶

The type of cost function we choose to use will depend on the type of problem we are working on. In a regression it would make sense to use mean squared error as it will return actual discrete or continouous numerical value. In terms of classification, since we are dealing with probabilties it would make sense to use something like categorical cross entropy loss. It is just a fancy name for log loss which is given by,

$$ L_i = -\sum \limits _j {y_{i,j}log(\hat{y_{i,j}})}$$

where $\hat{y}$ is the predicted confidence, y is one hot encoded target class, log is the natural log.

One hot encoding is the way to represent categorical values in binary format. Let's try to code this with a small example.

softmax_output = [0.6,0.3,0.1]
labels = [1,0,0]
loss = []
for i,j  in zip(softmax_output,labels):
    L = -(np.log(i)*j)
    loss.append(L)
print(sum(loss))

0.5108256237659907

There is another tricky part here. There might be cases where the output probabilties can be zero, which means there would be infinity values.

softmax_output = [0.6,0.4,0.0]
labels = [1,0,0]
loss = []
for i,j  in zip(softmax_output,labels):
    L = -(np.log(i)*j)
    loss.append(L)
print(sum(loss))

nan

<ipython-input-81-56baca09d933>:5: RuntimeWarning: divide by zero encountered in log
  L = -(np.log(i)*j)
<ipython-input-81-56baca09d933>:5: RuntimeWarning: invalid value encountered in double_scalars
  L = -(np.log(i)*j)

To counter this we should clip the input values to our loss function to a value that is close to zero but not zero for eg, $ 10^{-7} $

clipped = np.clip(softmax_output, 1e-7, 1-1e-7)
print(softmax_output)
print(clipped)

[0.6, 0.4, 0.0]
[6.e-01 4.e-01 1.e-07]

Now lets implement this loss function in our object and calculate the loss for those arbitrary data points. The above code will be re-done in array form with numpy but the operation is the same.

class FullyConnected:
    def __init__(self,n_inputs,n_neurons):
        self.weights = np.random.randn(n_inputs, n_neurons)
        self.bias = np.zeros((1,n_neurons))
    def forward(self,inputs):
        self.output = np.dot(inputs,self.weights) + self.bias

class ActivationRelu:
    def forward(self, inputs):
        self.output = np.maximum(0,inputs)


class ActivationSoftmax:
    def forward(self,inputs):
        exp_values = np.exp(inputs- np.max(inputs, axis=1,keepdims=True))
        probabilites = exp_values/np.sum(exp_values, axis=1,keepdims=True)

        self.output = probabilites

class Loss:
    def calculate(self,output,y):
        loss_samples = self.forward(output,y)
        data_loss = np.mean(loss_samples)
        return data_loss
        
class CategoricalCrossEntropy(Loss):
    def forward(self,y_pred,y_true):
        samples = len(y_pred)
        y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)

        confidences = y_pred_clipped[range(samples),y_true]
        negative_log_liklehood = -np.log(confidences)

        return negative_log_liklehood

    
layer1 = FullyConnected(2,5)
layer2 = FullyConnected(5,2)
activation1 = ActivationRelu()
activation2 = ActivationSoftmax()


layer1.forward(X)
activation1.forward(layer1.output)

layer2.forward(activation1.output)
activation2.forward(layer2.output)

loss_function = CategoricalCrossEntropy()
loss = loss_function.calculate(activation2.output,y)

print(f'cross entropy loss: {loss}')

cross entropy loss: 0.8587444069528419

Now the only thing left is to make our network learn.