Back Propagation from the very basic (Neural Networks)

6 min readAug 12, 2020

Back propagation is the workhorse of Modern Neural Networks. In this article I have tried to simplify the math behind it.

To understand Back propagation lets go through the prerequisites first.

Chain Rule of Differential Calculus:

It simply states ∂ a/∂ c= (∂ a / ∂ b)×(∂ b/ ∂ c), it is nothing but multiplying and dividing the equation by the same quantity.

Vector addition and subtraction:

A vector is an object that has both magnitude and direction. if a and b are two vectors then c = a + b is a new vector whose tail starts from tail of a and head ends at the head of b.

Similarly c = a-b is equivalent to a + (-b) where -b is opposite to the direction of b having the same magnitude, rest is same as stated above.

Gradient:

Lets take a look at the function z = x+ y, to represent this function we need 3 dimensional space one each to represent variables x, y, z. Similarly if n variables are involved we need n dimensional space one for each variable.

The gradient of a function F is a vector that is sum of the partial differential of function with respect to each variable, it will be a n dimensional vector if number of variables are n . It is denoted by ∇. For z = x+ y

∇z = ∂ z /∂ x i + ∂ z / ∂ y j = ∂(x+y)/∂ x i+ ∂(x+y)/∂(y)j = i + j where i and j are unit vectors for x and y axis.

Gradient points to the direction of steepest increase in the value of function.

Gradient Descent

A point in two dimension is represented in form p = (x, y) similarly a point in n dimensions is represented as p = (a₁, a₂, a₃…., aₙ) which can also be represented as a position vector from origin (0, 0, .. n times) to (a₁, a₂, a₃…., aₙ).

For a function f , ∇f at a point say p, points in the direction of greatest increase therefore -∇f must point in the direction of greatest decrease. Therefore if we keep choosing points in the direction -∇f on the function curve we will surely hit the minima at some point, this 2D visualization makes it clear.

The formula for selecting points until we get minima of f becomes p’ = p-∇f , we want to move very small steps so as to remain on curve and not skip any minimum inadvertently, to do that we scale down ∇f by multiplying it with small value say η, the final equation becomes p’ = p-η*∇f.

If we talk in terms of a neural network f becomes our cost function and our weights and biases becomes points in an n dimensional space.

I will use C for cost function, w and b for weights and biases respectively.

why we need back propagation?

Suppose we have 1 Million weights and 1000 inputs, for each input we have to calculate ∇C which itself involve calculation of partial derivatives with respect to each component of weight w which will be equal to 1 M calculations and for 1000 inputs it will be 1 Billion calculations and same goes for biases. So its very time consuming.

Back propagation

Few Notations first.

zᵢ = aᵢ₋₁ . wᵢ + bᵢ, the output before applying activation function, it is a vector having number of components equal to number of neurons in the iᵗʰ layer.

aᵢ = σ(aᵢ₋₁ . wᵢ + bᵢ) = σ(zᵢ), output of neurons at layer i, it is a vector having number of components equal to number of neurons in the iᵗʰ layer.

wᵢ and bᵢ are weight and bias vectors of layer i.

δᵢ = ∂C / ∂zᵢ , defined as error at layer i, this is a vector having number of components equal to number of neurons in the iᵗʰ layer .

Again to mention where ever the i is used in subscript it indicates that quantity vector for layer i of the neural network.

Back propagation involves four equations, these are:

δᵢ = ∇Cₐ. (∂/∂z × σ(zᵢ)) = ∇Cₐ . σ’(zᵢ) (where ∇Cₐ is the gradient of C with respect to aₗ of the Lᵗʰ layer) …… (1)

δᵢ = ((wᵢ₊₁)× δᵢ₊₁). σ’(zᵢ) …… (2)

∂C / ∂bᵢ = δᵢ ( where ∂C / ∂bᵢ is the gradient of C with respect to biases at layer i)..…. (3)

∂C / ∂wᵢ = aᵢ₋₁ × δᵢ (where wᵢ is weights of iᵗʰ layer and ∂C / ∂wᵢ is gradient of C with respect to weights of layer i) …… (4)

The {.} in above equations represent the Hadamard dot product of vectors.

Suppose the neural network has L layers, where Lᵗʰ layer is the output layer.

Step 1: For Lᵗʰ layer we can easily calculate ∇Cₗ (partial of C with respect to each weight in Lᵗʰ layer) . σ’(zᵢ) = σ(zᵢ)× (1-σ(zᵢ)) if σ is the Sigmoid function , zᵢ values are already calculated in forward pass, so δₗ will be calculated using equation (1).

Step 2: Once we have δₗ Using equation (3) and (4) we can calculate ∇C with respect to weights and biases of layer L and hence update the weights and biases for this layer.

Step 3: Now we have updated weights and biases for layer L, using equation (2) we can calculate δ for layer L -1.

Now we can iterate over step 2 and 3 and find updated weights and biases for all the layers of neural network.

The proof of these four equations is as follows:

For equation (1) δᵢ = ∇Cₐ. (∂/∂z × σ(zᵢ))

δᵢ = ∂C / ∂zᵢ

using chain rule of differential calculus

∂C / ∂zᵢ = (∂ C/∂ aᵢ) × (∂ aᵢ /∂ zᵢ)

since aᵢ = σ (zᵢ) therefore ∂ aᵢ /∂ zᵢ = σ’(zᵢ)

so δₗ = ∇Cₐ . σ’(zₗ)

Proof of equation (2) δᵢ = ((wᵢ₊₁)× δᵢ₊₁). σ’(zᵢ)

δᵢ = ∂C / ∂zᵢ = ∂ C/∂ zᵢ₊₁ × ∂ zᵢ₊₁/∂ zᵢ = δᵢ₊₁ × ∂ zᵢ₊₁ /∂ zᵢ

as ∂ C/∂ zᵢ₊₁ = δᵢ₊₁

∂ C/∂ zᵢ₊₁ × ∂ zᵢ₊₁/∂ zᵢ = δᵢ₊₁ × ∂ zᵢ₊₁/∂ zᵢ

from definition of zᵢ , zᵢ₊₁ = aᵢ . wᵢ₊₁ + bᵢ₊₁, further substitution aᵢ = σ(zᵢ)

zᵢ₊₁ = σ(zᵢ). wᵢ₊₁ + bᵢ₊₁

∂ zᵢ₊₁/∂ zᵢ = ∂ (σ(zᵢ). wᵢ₊₁) /∂ zᵢ + ∂(bᵢ₊₁)/∂ zᵢ

∂(bᵢ₊₁)/∂ zᵢ = 0 as b is constant with respect to z and hence its differetial with respect to z is zero.

so ∂ zᵢ₊₁/∂ zᵢ = σ’(zᵢ).wᵢ₊₁

hence δᵢ = ((wᵢ₊₁)× δᵢ₊₁). σ’(zᵢ)

Now moving to equation (3) ∂C / ∂bᵢ = δᵢ

∂ C / ∂ b = ∂ C /∂ z × ∂ z/∂ b (by chain rule of partial differential i.e. multiply and divide by same term)

z = w.a + b hence ∂ z / ∂ b = 1 as w.a term will give zero (it is constant with respect to b hence will give zero when differentiated with respect to it) and partial of b with respect to b will be 1.

Hence ∂ C / ∂ b = ∂ C /∂ z × 1 = ∂ C /∂ z = δ

The proof of equation (4) is similar and I leave to the readers.