Category
>Machine Learning
>Python Programming

How to do Feature Scaling In Machine Learning Using Python

Tanesh Balodi
Jul 27, 2021

Scalability is one of the most growing topics in machine learning and big data. When we implement machine learning and integrate it to the web, we may see it working all fine with a limited user base, but whenever the user base increases, the working of your model might collapse which would mean that the model is not yet scalable. So what exactly is scalability in machine learning? And how to implement it is what we are going to discuss in this blog.

Most of the time the problem like scalability is not handled before deploying the model but that does not mean that we cannot scale it before. We will discuss a few ways to scale the machine learning model for big data.

What is Scalability?

Consider you build a software and deploy it, after sometime, when the user base steadily grows, do you see a change in the characteristics of your software? Common answer would be a big ‘NO’, but is deploying software the same as deploying a machine learning model? The answer is that it is not the same as deploying software.

As your machine learning model gets more and more user, the data will also increase, and machine learning is all about the predictions and accuracy, so as the user base of the model increases, the characteristics of the model will also change, or let’s say there are huge chances of the change in the behaviour of the model, this change could be positive for the model, or could be negative.

This is the main reason we need scalability in machine learning and also the reason why most of the time we don’t scale our model before deploying.

(Must read: Implementing Gradient Boosting Algorithm Using Python)

Scaling the Machine Learning Dataset

There are a few methods by which we could scale the dataset, that in turn would be helping in scaling the machine learning model. One such method is called ‘feature scaling’. The question is what type of machine learning algorithm actually needs the scaling of data?

The algorithms like KNN, K-means, logistic regression, linear regression, decision tree, and more that need gradient descent, distance formulas, or decision making at every step to perform their functions need the proper scaling of the data.

A few algorithms that needs scaling

What is Feature Scaling?

Let’s discuss feature scaling in detail, if we consider two values in a row, ‘300cm’ and and ‘3m’, now we know that 1m is equal to 100cm, therefore both the values in a row are one and the same, but the problem is that our model will read both of the value with a different perception, for our machine learning model, the value of 300cm is more than the value of 3m.

Feature scaling scales this difference by making everything within the range of 0 to 1. There are two methods that are used for feature scaling in machine learning, these two methods are known as normalization and standardization, let's discuss them in detail-:

Normalization

One of the scaling techniques used is known as normalization, scaling is done in order to encapsulate all the features within the range of 0 to 1. This is also known as min-max normalization. The formula for min-max normalization is written below-:

Normalization = x - x_minimum / x_maximum - x_minimum

Here, Xminimum is the minimum value of the feature and xmaximum is the maximum value of the feature.

Let’s implement normalization using python-:

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd



Importing numpy, pandas, and matplotlib for numerical operation, data reading/manipulation, and for graphical representation of data.



dataset = pd.read_csv('dataset.csv')

dataset = dataset.values

dataset.shape

(42000, 785)

In the next step, we are reading the dataset and noting the rows and columns within the dataset with the help of dataset.shape

X, y = dataset[:, 1:], dataset[:, 0]

X.shape, y.shape

((42000, 784), (42000,))



We have divided the dataset into X and y in the next step, we can note the shape of the dataset we have divided.



y

array([1, 0, 1, ..., 7, 6, 9])



We can see all the labels inside ‘y’, now all we need to do further is to apply min max scaler to our dataset.



# Min-Max Scaler



X = (X - X.min()) / (X.max() - X.min())

We have successfully applied the min-max scalar formula using some functions, .max() to get the maximum value, and .min() to get the minimum value.

Standardization

Standardization is another scaling technique that uses mean and standard deviation to standardize the dataset, no range is provided in this particular scaling technique, let’s discuss the formula-:

Standardization = (x - mean)/ standard deviation

Because standardization doesn’t have any particular range, outliers within the data is not a problem here, but outliers may get affected by the normalization technique.

Normalization is most commonly used in neural networks, k-means clustering, knn, and another algorithm that does not use any sort of distribution technique while standardization is used mainly in the algorithms that use the distribution technique.

In order to implement standardization, we can use the sklearn library as shown below-:


from sklearn.preprocessing import StandardScaler

from sklearn.datasets import load_boston



Importing a standard scaler from sklearn, this is a python library to perform standardization with the Boston dataset from the sklearn.datasets.



X, y = load_boston(return_X_y=True)



Here, we are splitting the dataset into dependent and independent variables.



print(X.shape)

(506, 13)



print(y.shape)

(506,)



Printing the total number of rows and columns in the dependent and independent variable respectively, we can see that ‘X’ has 506 rows and 13 columns whereas ‘y’ has 506 rows and 0 columns.



scaler= StandardScaler()

standardization = scaler.fit_transform(X)



Given the variable name ‘scaler’ to our standard scalar function and used fit.transform function to fit the dataset to the function.



print(standardization)

In our next and final step, we have printed the standardized value, we can see and analyze the value by ourselves.

Normalization and standardization are used most commonly in almost every machine learning and deep learning algorithm, therefore, the above python implementation would really help in building a model with perfect feature scaling.

Conclusion

To conclude, scaling the dataset is key to achieve the highest accuracy of the machine learning model. Feature scaling techniques like normalization and standardization are practical and easy to implement, few of the benefits of feature scaling are that it makes the model faster, performs better in the algorithms using gradient descent to find the local minima, and gives the more optimized result.

(Recommended blog: Cost Function in machine learning)

We have also discussed the problem with the outliers while using the normalization, so by keeping a few things in mind, we could achieve better optimization.