Back Propagation from the very basic (Neural Networks)
Back propagation is the workhorse of Modern Neural Networks. In this article I have tried to simplify the math behind it.
To understand Back propagation lets go through the prerequisites first.
Chain Rule of Differential Calculus:
It simply states ∂ a/∂ c= (∂ a / ∂ b)×(∂ b/ ∂ c), it is nothing but multiplying and dividing the equation by the same quantity.
Vector addition and subtraction:
A vector is an object that has both magnitude and direction. if a and b are two vectors then c = a + b is a new vector whose tail starts from tail of a and head ends at the head of b.
Similarly c = a-b is equivalent to a + (-b) where -b is opposite to the direction of b having the same magnitude, rest is same as stated above.
Gradient:
Lets take a look at the function z = x+ y, to represent this function we need 3 dimensional space one each to represent variables x, y, z. Similarly if n variables are involved we need n dimensional space one for each variable.
The gradient of a function F is a vector that is sum of the partial differential of function with respect to each variable, it will be a n dimensional vector if number of variables are n . It is denoted by ∇. For z = x+ y
∇z = ∂ z /∂ x i + ∂ z / ∂ y j = ∂(x+y)/∂ x i+ ∂(x+y)/∂(y)j = i + j where i and j are unit vectors for x and y axis.
Gradient points to the direction of steepest increase in the value of function.
Gradient Descent
A point in two dimension is represented in form p = (x, y) similarly a point in n dimensions is represented as p = (a₁, a₂, a₃…., aₙ) which can also be represented as a position vector from origin (0, 0, .. n times) to (a₁, a₂, a₃…., aₙ).
For a function f , ∇f at a point say p, points in the direction of greatest increase therefore -∇f must point in the direction of greatest decrease. Therefore if we keep choosing points in the direction -∇f on the function curve we will surely hit the minima at some point, this 2D visualization makes it clear.
The formula for selecting points until we get minima of f becomes p’ = p-∇f , we want to move very small steps so as to remain on curve and not skip any minimum inadvertently, to do that we scale down ∇f by multiplying it with small value say η, the final equation becomes p’ = p-η*∇f.
If we talk in terms of a neural network f becomes our cost function and our weights and biases becomes points in an n dimensional space.
I will use C for cost function, w and b for weights and biases respectively.
why we need back propagation?
Suppose we have 1 Million weights and 1000 inputs, for each input we have to calculate ∇C which itself involve calculation of partial derivatives with respect to each component of weight w which will be equal to 1 M calculations and for 1000 inputs it will be 1 Billion calculations and same goes for biases. So its very time consuming.
Back propagation
Few Notations first.
zᵢ = aᵢ₋₁ . wᵢ + bᵢ, the output before applying activation function, it is a vector having number of components equal to number of neurons in the iᵗʰ layer.
aᵢ = σ(aᵢ₋₁ . wᵢ + bᵢ) = σ(zᵢ), output of neurons at layer i, it is a vector having number of components equal to number of neurons in the iᵗʰ layer.
wᵢ and bᵢ are weight and bias vectors of layer i.
δᵢ = ∂C / ∂zᵢ , defined as error at layer i, this is a vector having number of components equal to number of neurons in the iᵗʰ layer .
Again to mention where ever the i is used in subscript it indicates that quantity vector for layer i of the neural network.
Back propagation involves four equations, these are:
δᵢ = ∇Cₐ. (∂/∂z × σ(zᵢ)) = ∇Cₐ . σ’(zᵢ) (where ∇Cₐ is the gradient of C with respect to aₗ of the Lᵗʰ layer) …… (1)
δᵢ = ((wᵢ₊₁)× δᵢ₊₁). σ’(zᵢ) …… (2)
∂C / ∂bᵢ = δᵢ ( where ∂C / ∂bᵢ is the gradient of C with respect to biases at layer i)..…. (3)
∂C / ∂wᵢ = aᵢ₋₁ × δᵢ (where wᵢ is weights of iᵗʰ layer and ∂C / ∂wᵢ is gradient of C with respect to weights of layer i) …… (4)
The {.} in above equations represent the Hadamard dot product of vectors.
Suppose the neural network has L layers, where Lᵗʰ layer is the output layer.
Step 1: For Lᵗʰ layer we can easily calculate ∇Cₗ (partial of C with respect to each weight in Lᵗʰ layer) . σ’(zᵢ) = σ(zᵢ)× (1-σ(zᵢ)) if σ is the Sigmoid function , zᵢ values are already calculated in forward pass, so δₗ will be calculated using equation (1).
Step 2: Once we have δₗ Using equation (3) and (4) we can calculate ∇C with respect to weights and biases of layer L and hence update the weights and biases for this layer.
Step 3: Now we have updated weights and biases for layer L, using equation (2) we can calculate δ for layer L -1.
Now we can iterate over step 2 and 3 and find updated weights and biases for all the layers of neural network.
The proof of these four equations is as follows:
For equation (1) δᵢ = ∇Cₐ. (∂/∂z × σ(zᵢ))
δᵢ = ∂C / ∂zᵢ
using chain rule of differential calculus
∂C / ∂zᵢ = (∂ C/∂ aᵢ) × (∂ aᵢ /∂ zᵢ)
since aᵢ = σ (zᵢ) therefore ∂ aᵢ /∂ zᵢ = σ’(zᵢ)
so δₗ = ∇Cₐ . σ’(zₗ)
Proof of equation (2) δᵢ = ((wᵢ₊₁)× δᵢ₊₁). σ’(zᵢ)
δᵢ = ∂C / ∂zᵢ = ∂ C/∂ zᵢ₊₁ × ∂ zᵢ₊₁/∂ zᵢ = δᵢ₊₁ × ∂ zᵢ₊₁ /∂ zᵢ
as ∂ C/∂ zᵢ₊₁ = δᵢ₊₁
∂ C/∂ zᵢ₊₁ × ∂ zᵢ₊₁/∂ zᵢ = δᵢ₊₁ × ∂ zᵢ₊₁/∂ zᵢ
from definition of zᵢ , zᵢ₊₁ = aᵢ . wᵢ₊₁ + bᵢ₊₁, further substitution aᵢ = σ(zᵢ)
zᵢ₊₁ = σ(zᵢ). wᵢ₊₁ + bᵢ₊₁
∂ zᵢ₊₁/∂ zᵢ = ∂ (σ(zᵢ). wᵢ₊₁) /∂ zᵢ + ∂(bᵢ₊₁)/∂ zᵢ
∂(bᵢ₊₁)/∂ zᵢ = 0 as b is constant with respect to z and hence its differetial with respect to z is zero.
so ∂ zᵢ₊₁/∂ zᵢ = σ’(zᵢ).wᵢ₊₁
hence δᵢ = ((wᵢ₊₁)× δᵢ₊₁). σ’(zᵢ)
Now moving to equation (3) ∂C / ∂bᵢ = δᵢ
∂ C / ∂ b = ∂ C /∂ z × ∂ z/∂ b (by chain rule of partial differential i.e. multiply and divide by same term)
z = w.a + b hence ∂ z / ∂ b = 1 as w.a term will give zero (it is constant with respect to b hence will give zero when differentiated with respect to it) and partial of b with respect to b will be 1.
Hence ∂ C / ∂ b = ∂ C /∂ z × 1 = ∂ C /∂ z = δ
The proof of equation (4) is similar and I leave to the readers.