The fundamental purpose of this post is to brief regarding the PCA algorithm step by step and in a way that everyone can easily understand what can actually PCA do and how we can use PCA in project/algorithm. Before proceeding here is a quick overview of what we cover in this post.
What is PCA?
What is a Variance?
Why Normalization is Necessary for PCA?
Practical Examples of PCA
Code in Python
It is an unsupervised machine learning algorithm. PCA is mainly used for dimensionality reduction in a dataset consisting of many variables that are highly correlated or lightly correlated with each other while retaining the variation present in the dataset up to a maximum extent. It is also a great tool for exploratory data analysis for making predictive models.
It performs a linear transformation on the data so that most of the variance or information in your high-dimensional dataset is captured by the first few principal components. The first principal component will capture the most variance, followed by the second principal component and so on.
Each principal component is a linear combination of the original variables. Because all the principal components are orthogonal to each other, there is no redundant information. So, the total variance in the data is defined as the sum of the variances of the individual component. So decide the total number of principal components according to cumulative variance ‘‘explained’’ by them.
Mathematically, rigorous treatments of probability, we find a formal definition that is very enlightening. The variance of a random variable X is defined as:
Var(X) = E(X^2) - E(X)^2
Here E stands for expectations that are so simple and elegant that at first, it might not even be clear what's happening. Variance is the difference between when we square the inputs to Expectations and when we square the Expectations itself.
Normalization is necessary to make every variable in proportion with each other, we have seen that the models which are not scaled properly in accordance with each other tend to perform poorly in comparison to those which are scaled well. We cannot extract features if two variables have a large scaling difference.
You have a dataset that includes measurements for different sensors on an engine (temperatures, pressures, emissions, and so on). While much of the data comes from a healthy engine, the sensors have also captured data from the engine when it needs maintenance. You cannot see any obvious abnormalities by looking at any individual sensor. However, by applying PCA, you can transform this data so that most variations in the sensor measurements are captured by a small number of principal components. It is easier to distinguish between a healthy and unhealthy engine by inspecting these principal components than by looking at the raw sensor data.
You have a dataset that includes measurements for different variables on wine (alcohol, ash, magnesium, and so on). You cannot see any obvious abnormalities by looking at any individual variables. However, by applying PCA, you can transform this data so that most variations in the measurements of the variables are captured by a small number of principal components. It is easier to distinguish between red and white wine by inspecting these principal components than by looking at the raw variable data.
Before implementing the PCA algorithm in python first you have to download the wine data set. Below attach source contains a file of wine dataset so download first to proceed
First of all, before processing algorithms, we have to import some libraries and read a file with the help of pandas.
As we call out the dataset with the help of the Pandas data frame, now we have to split our dataset into training and testing set with test size is 0.2 times of dataset and remaining data is our training data.
The next step is to do feature scaling of train and test dataset with help of StandardScaler.
We are applying the PCA algorithm for two-component and fitting logistic regression to the training set and predict the result.
As we see that in predicting result our accuracy score is approx. 97% which is good for predicting test set results. After predicting, we visualize our training set results using 2 components.
As we visualize the result on the training dataset, now we do a similar process for test set results and see our accuracy of the dataset using two components. As we take two components so we see that the first component shows most variation occurs between features and the second component shows most variation occurs between plotted features.
In the principal component space, you should be able to see your objects cluster in a meaningful way. As we learned the basic introduction of the PCA algorithm in this blog. More blogs are on the way where you will learn PCA in depth. Keep reading and exploring Analytics Steps. Till then, Happy Reading!
What is the OpenAI GPT-3?READ MORE
Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working EcosystemREAD MORE
Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & ModelsREAD MORE
6 Major Branches of Artificial Intelligence (AI)READ MORE
Top 10 Big Data Technologies in 2020READ MORE
7 types of regression techniques you should know in Machine LearningREAD MORE
How is Artificial Intelligence (AI) Making TikTok Tick?READ MORE
7 Types of Activation Functions in Neural NetworkREAD MORE
8 Most Popular Business Analysis Techniques used by Business AnalystREAD MORE
Introduction to Logistic Regression - Sigmoid Function, Code ExplanationREAD MORE