Stochastic gradient descent is a popular and most common machine learning algorithm that is used to train a neural network. Stochastic gradient descent is a clever approach to gradient descent, where it tackles the major limitation of the gradient descent algorithm.
As the name suggests, gradient means inclination and to incline in a descending manner to find the local minima or the minimum surface is the goal of this algorithm.
So in this article, we are going to understand why we use a stochastic gradient descent algorithm (SGD) rather than gradient descent. Later on, we will implement SGD with the help of SKlearn.
Gradient descent is an optimizing algorithm that is used to optimize almost everything in machine learning, from the straight linear lines, non-linear, even clusters, it covers all of them. This is the main reason this algorithm is considered to be a very powerful tool in machine learning.
The gradient descent is almost like you are walking down the mountain to the lowest point of that mountain, now how you are coming down also makes a big impact on the conclusion.
Let’s say you have fallen from the highest point, now you may or may not reach the lowest point of that mountain if you came down rolling, there are high chances that with gravity you might surpass the lowest point but again due to gravity.
You will be at the lowest point, so rolling is a good option here. Same way, in order to find the lowest point of the surface, we will take a few steps from the highest point, these steps are known as ‘Learning Rate’.
(Also Read: What is Cost Function?)
Initially, we need to analyze the slope of the function wrt their features or we need to reckon the gradient for that function.
Function’s Graph
In the second step, we will start with any random point, and then we will find the derivative of the function or the gradient of that function, for example, if the function was (x+5)4, then the derivative of this function would be 4(x+5).
Random Initial Point
Move downwards toward the local or global minima, or the lowest point of the gradient. We discussed the learning rate above. The learning rate defines how much we will move downward. It is a general strategy to take a larger learning rate initially and then make it smaller gradually in each step.
Learning Rate
Do the 1st iteration, and for the second iteration, use
I2 = (Initial Random Starting Point) - (Learning Rate) * (Derivative of the Function)
I3 = I2 - (Learning Rate) * (Derivative of the Function Using the I2 Value)
We have to repeat the process until our gradient goes to 0, which means there is no slope left or the bottom of the process is achieved.
(Recommended: 5 ML techniques to prevent overfitting)
The problem lies in the execution, while executing the above steps, did you notice the sheer amount of steps are there to perform this algorithm?
If we have a million data points and 20 features associated with them, it will take us 20 million computations for the derivative and only for 1 iteration and we don’t usually take less than 500 iterations, therefore around 10 billion computations, which is very high, in short Gradient Descent cannot handle Big Data.
(Must read: Feature scaling in machine learning)
The implementation is the same as the steps we have discussed above, we have to import necessary python libraries like numpy, matplotlib, and dataset from sklearn.
Later we are implementing all the steps like initializing the random data point, computing loss, calculating gradients, derivation. At last, we are making a function to update the weights.
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=500, n_features=1, bias=4.5, noise=3.3)
class UniVariateLinearRegression:
def __init__(self, X, y):
self.X = X
self.y = y
self.coef = np.random.uniform(low=-1, high=1)
self.bias = np.random.random()
def compute_loss(self):
losses = []
for x,y in zip(self.X, self.y):
yhat = self.predict(x)
loss = (y - yhat)**2
losses.append(loss)
losses = np.array(losses)
return losses.sum() / (2 * self.X.shape[0])
### Gradient Descent
def calculate_gradients(self):
grad_00, grad_01 = list(), list()
for x,y in zip(self.X, self.y):
yhat = self.predict(x)
grad_00.append(yhat - y)
grad_01.append((yhat - y)*x)
grad_00, grad_01 = np.array(grad_00), np.array(grad_01)
grad_00 = grad_00.sum() / (self.X.shape[0])
grad_01 = grad_01.sum() / (self.X.shape[0])
return (grad_00, grad_01) # Bias, Coef
def update_weights(self, gradients, learning_rate):
self.bias = self.bias - (learning_rate * gradients[0])
self.coef = self.coef - (learning_rate * gradients[1])
Stochastic Gradient Descent is a clever approach to implement gradient descent, and also it makes the computations way less than the Gradient Descent.
The word ‘stochastic’ means random, and here stochastic gradient descent means that SGD picks up only on random data point at the time of iteration, not every data point, so for example, if there were 3 data points, the total number of terms is reduced by the factor of 3. So for a million samples, a million computations are required. (From)
Some of the major advantages of SGD are-:
They work faster than the Gradient Descent
SGD can reduce the data redundancy in a dataset
For larger datasets or Big data, it converges faster
Regular updates help in getting out of local minima of the loss function
Some major Disadvantages of SGD are-:
Computation is very costly if compared with Gradient Descent
Frequent updates lead to a noisy learning rate
(Suggest reading: How to use Gradient Boosting Algorithm Using Python?)
from sklearn.linear_model import SGDRegressor
clf = SGDRegressor(max_iter=1000, tol=1e-3)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
In order to implement SGD, we have a special module in SKlearn that helps us to implement Stochastic Gradient Descent, so all we need is to pre-process the dataset and split it into the training and testing dataset, now all we need to do is to fit our dataset into the SGD regressor.
This blog holds valuable information and insights about the difference between Gradient descent and Stochastic Gradient Descent algorithm. We can now safely conclude that gradient descent does not work properly or tends to get very slow when the dataset is huge while SGD can handle big data with ease.
(Read also: What is LightGBM Algorithm?)
However, every technique has some or other kind of limitation, and in the case of SGD the problem lies in its expensive computation and noise.
Elasticity of Demand and its Types
READ MORE5 Factors Influencing Consumer Behavior
READ MOREWhat is PESTLE Analysis? Everything you need to know about it
READ MOREAn Overview of Descriptive Analysis
READ MORE5 Factors Affecting the Price Elasticity of Demand (PED)
READ MORE6 Major Branches of Artificial Intelligence (AI)
READ MOREDijkstra’s Algorithm: The Shortest Path Algorithm
READ MORE8 Most Popular Business Analysis Techniques used by Business Analyst
READ MOREWhat is Managerial Economics? Definition, Types, Nature, Principles, and Scope
READ MORE7 Types of Statistical Analysis: Definition and Explanation
READ MORE
Comments