Implementing Variance and Normalization under Principal Component Analysis with Python code

  • Tanesh Balodi
  • Sep 06, 2019
  • Machine Learning
Implementing Variance and Normalization under Principal Component Analysis with Python code title banner

The fundamental purpose of this post is to brief regarding the PCA algorithm step by step and in a way that everyone can easily understand what can actually PCA do and how we can use PCA in project/algorithm. Before proceeding here is a quick overview of what we cover in this post.

 

Topics Covered

 

  1. What is PCA?

  2. What is a Variance?

  3. Why Normalization is Necessary for PCA?

  4. Practical Examples of PCA

  5. Code in Python

 

What is Principal Component Analysis (PCA)?

 

 

It is an unsupervised machine learning algorithm. PCA is mainly used for dimensionality reduction in a dataset consisting of many variables that are highly correlated or lightly correlated with each other while retaining the variation present in the dataset up to a maximum extent. It is also a great tool for exploratory data analysis for making predictive models.

 

It performs a linear transformation on the data so that most of the variance or information in your high-dimensional dataset is captured by the first few principal components. The first principal component will capture the most variance, followed by the second principal component and so on.

 

Each principal component is a linear combination of the original variables. Because all the principal components are orthogonal to each other, there is no redundant information. So, the total variance in the data is defined as the sum of the variances of the individual component. So decide the total number of principal components according to cumulative variance ‘‘explained’’ by them.

 

 

What is a Variance?

 

 

Mathematically, rigorous treatments of probability, we find a formal definition that is very enlightening. The variance of a random variable X is defined as:

Var(X) = E(X^2) - E(X)^2​​

Here E stands for expectations that are so simple and elegant that at first, it might not even be clear what's happening. Variance is the difference between when we square the inputs to Expectations and when we square the Expectations itself.

 

Variance in principal component analysis is explained

 

Why Normalization is Necessary for PCA?

 

Normalization is necessary to make every variable in proportion with each other, we have seen that the models which are not scaled properly in accordance with each other tend to perform poorly in comparison to those which are scaled well. We cannot extract features if two variables have a large scaling difference.

 

Practical Examples of PCA

 

Example: Engine Health Monitoring

You have a dataset that includes measurements for different sensors on an engine (temperatures, pressures, emissions, and so on). While much of the data comes from a healthy engine, the sensors have also captured data from the engine when it needs maintenance. You cannot see any obvious abnormalities by looking at any individual sensor. However, by applying PCA, you can transform this data so that most variations in the sensor measurements are captured by a small number of principal components. It is easier to distinguish between a healthy and unhealthy engine by inspecting these principal components than by looking at the raw sensor data.

 

Example: Wine Detection

 

You have a dataset that includes measurements for different variables on wine (alcohol, ash, magnesium, and so on). You cannot see any obvious abnormalities by looking at any individual variables. However, by applying PCA, you can transform this data so that most variations in the measurements of the variables are captured by a small number of principal components. It is easier to distinguish between red and white wine by inspecting these principal components than by looking at the raw variable data.

 

Before implementing the PCA algorithm in python first you have to download the wine data set. Below attach source contains a file of wine dataset so download first to proceed

 

Code In Python

 

Source: Wine.csv

First of all, before processing algorithms, we have to import some libraries and read a file with the help of pandas.

 

 

 

As we call out the dataset with the help of the Pandas data frame, now we have to split our dataset into training and testing set with test size is 0.2 times of dataset and remaining data is our training data.

 

 

The next step is to do feature scaling of train and test dataset with help of StandardScaler.

 

 

We are applying the PCA algorithm for two-component and fitting logistic regression to the training set and predict the result.

 

 

As we see that in predicting result our accuracy score is approx. 97% which is good for predicting test set results. After predicting, we visualize our training set results using 2 components.

 

 

 

As we visualize the result on the training dataset, now we do a similar process for test set results and see our accuracy of the dataset using two components. As we take two components so we see that the first component shows most variation occurs between features and the second component shows most variation occurs between plotted features.

 

 

Conclusion

 

In the principal component space, you should be able to see your objects cluster in a meaningful way. As we learned the basic introduction of the PCA algorithm in this blog. More blogs are on the way where you will learn PCA in depth. Keep reading and exploring Analytics Steps. Till then, Happy Reading!

0%

Tanesh Balodi

A splendid and inventive Machine Learning Intern in Analytics Steps. He loves deploying data in ML and finds out different insights to play with industry trends.

Trending blogs

  • What is the OpenAI GPT-3?

    READ MORE
  • Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & Models

    READ MORE
  • How is Artificial Intelligence (AI) Making TikTok Tick?

    READ MORE
  • 6 Major Branches of Artificial Intelligence (AI)

    READ MORE
  • 7 Types of Activation Functions in Neural Network

    READ MORE
  • 7 types of regression techniques you should know in Machine Learning

    READ MORE
  • Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working Ecosystem

    READ MORE
  • Top 10 Big Data Technologies in 2020

    READ MORE
  • Introduction to Logistic Regression - Sigmoid Function, Code Explanation

    READ MORE
  • What is K-means Clustering in Machine Learning?

    READ MORE
Write a BLOG