What is Stochastic Gradient Descent?

  • Tanesh Balodi
  • Aug 06, 2021
  • Deep Learning
  • Python Programming
What is Stochastic Gradient Descent? title banner

Stochastic gradient descent is a popular and most common machine learning algorithm that is used to train a neural network. Stochastic gradient descent is a clever approach to gradient descent, where it tackles the major limitation of the gradient descent algorithm.


As the name suggests, gradient means inclination and to incline in a descending manner to find the local minima or the minimum surface is the goal of this algorithm. 


So in this article, we are going to understand why we use a stochastic gradient descent algorithm (SGD) rather than gradient descent. Later on, we will implement SGD with the help of SKlearn.



What is Gradient Descent?


Gradient descent is an optimizing algorithm that is used to optimize almost everything in machine learning, from the straight linear lines, non-linear, even clusters, it covers all of them. This is the main reason this algorithm is considered to be a very powerful tool in machine learning.


The gradient descent is almost like you are walking down the mountain to the lowest point of that mountain, now how you are coming down also makes a big impact on the conclusion.


Let’s say you have fallen from the highest point, now you may or may not reach the lowest point of that mountain if you came down rolling, there are high chances that with gravity you might surpass the lowest point but again due to gravity.


You will be at the lowest point, so rolling is a good option here. Same way, in order to find the lowest point of the surface, we will take a few steps from the highest point, these steps are known as ‘Learning Rate’. 


(Also Read: What is Cost Function?)


Steps For Gradient Descent


  1. Initially, we need to analyze the slope of the function wrt their features or we need to reckon the gradient for that function.

image is representing the gradient formed with the help of the function

Function’s Graph

  1. In the second step, we will start with any random point, and then we will find the derivative of the function or the gradient of that function, for example, if the function was (x+5)4, then the derivative of this function would be 4(x+5).

random initial point to compute the gradients for gradient descent algorithm

Random Initial Point

  1. Move downwards toward the local or global minima, or the lowest point of the gradient. We discussed the learning rate above. The learning rate defines how much we will move downward. It is a general strategy to take a larger learning rate initially and then make it smaller gradually in each step.

image is representing the learning rate in gradient descent to find the local minima

Learning Rate

  1. Do the 1st iteration, and for the second iteration, use


I2 = (Initial Random Starting Point) - (Learning Rate) * (Derivative of the Function)


I3  = I2 - (Learning Rate) * (Derivative of the Function Using the I2 Value)

We have to repeat the process until our gradient goes to 0, which means there is no slope left or the bottom of the process is achieved.


(Recommended: 5 ML techniques to prevent overfitting)


Problem With Gradient Descent?


The problem lies in the execution, while executing the above steps, did you notice the sheer amount of steps are there to perform this algorithm?


If we have a million data points and 20 features associated with them, it will take us 20 million computations for the derivative and only for 1 iteration and we don’t usually take less than 500 iterations, therefore around 10 billion computations, which is very high, in short Gradient Descent cannot handle Big Data.


(Must read: Feature scaling in machine learning)


Implementing Gradient Descent Using Python From Scratch


The implementation is the same as the steps we have discussed above, we have to import necessary python libraries like numpy, matplotlib, and dataset from sklearn.


Later we are implementing all the steps like initializing the random data point, computing loss, calculating gradients, derivation. At last, we are making a function to update the weights.

import numpy as np

from matplotlib import pyplot as plt

from sklearn.datasets import make_regression

X, y = make_regression(n_samples=500, n_features=1, bias=4.5, noise=3.3)


class UniVariateLinearRegression:


    def __init__(self, X, y):

        self.X = X

        self.y = y

        self.coef = np.random.uniform(low=-1, high=1)

        self.bias = np.random.random()


    def compute_loss(self):

        losses = []

        for x,y in zip(self.X, self.y):

            yhat = self.predict(x)

            loss = (y - yhat)**2



        losses = np.array(losses)

        return losses.sum() / (2 * self.X.shape[0])


    ### Gradient Descent

    def calculate_gradients(self):

        grad_00, grad_01 = list(), list()


        for x,y in zip(self.X, self.y):

            yhat = self.predict(x)

            grad_00.append(yhat - y)

            grad_01.append((yhat - y)*x)


        grad_00, grad_01 = np.array(grad_00), np.array(grad_01)

        grad_00 = grad_00.sum() / (self.X.shape[0])

        grad_01 = grad_01.sum() / (self.X.shape[0])

        return (grad_00, grad_01) # Bias, Coef


    def update_weights(self, gradients, learning_rate):

        self.bias = self.bias - (learning_rate * gradients[0])

        self.coef = self.coef - (learning_rate * gradients[1])


About Stochastic Gradient Descent


Stochastic Gradient Descent is a clever approach to implement gradient descent, and also it makes the computations way less than the Gradient Descent. 


The word ‘stochastic’ means random, and here stochastic gradient descent means that SGD picks up only on random data point at the time of iteration, not every data point, so for example, if there were 3 data points, the total number of terms is reduced by the factor of 3. So for a million samples, a million computations are required. (From)


Some of the major advantages of SGD are-:


  • They work faster than the Gradient Descent

  • SGD can reduce the data redundancy in a dataset

  • For larger datasets or Big data, it converges faster

  • Regular updates help in getting out of local minima of the loss function


Some major Disadvantages of SGD are-:


  • Computation is very costly if compared with Gradient Descent

  • Frequent updates lead to a noisy learning rate


(Suggest reading: How to use Gradient Boosting Algorithm Using Python?)


Implementing Stochastic Gradient Descent Using Python

from sklearn.linear_model import SGDRegressor

clf = SGDRegressor(max_iter=1000, tol=1e-3)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

In order to implement SGD, we have a special module in SKlearn that helps us to implement Stochastic Gradient Descent, so all we need is to pre-process the dataset and split it into the training and testing dataset, now all we need to do is to fit our dataset into the SGD regressor.





This blog holds valuable information and insights about the difference between Gradient descent and Stochastic Gradient Descent algorithm. We can now safely conclude that gradient descent does not work properly or tends to get very slow when the dataset is huge while SGD can handle big data with ease. 


(Read also:  What is LightGBM Algorithm?)


However, every technique has some or other kind of limitation, and in the case of SGD the problem lies in its expensive computation and noise.