The fundamental purpose of this post is to brief regarding the PCA algorithm step by step and in a way that everyone can easily understand what can actually PCA do and how we can use PCA in the project/algorithm. Before proceeding here is a quick overview of what we cover in this post.
What is PCA?
What is a Variance?
Why is Normalization Necessary for PCA?
Practical Examples of PCA
Code in Python
PCA is an unsupervised machine learning algorithm. PCA is mainly used for dimensionality reduction in a dataset consisting of many variables that are highly correlated or lightly correlated with each other while retaining the variation present in the dataset up to a maximum extent. It is also a great tool for exploratory data analysis for making predictive models.
While it is said that the more data we have, the more accurate results we are going to observe, while it is rightly said but it’s not just data that we require, we need high-quality data to get better results. Dimensionality reduction is basically a technique where we reduce the number of columns or features from the dataset based on their relevance to the problem, the least their requirement, the more are the chances that they get removed from the dataset.
PCA performs a linear transformation on the data so that most of the variance or information in your high-dimensional dataset is captured by the first few principal components. The first principal component will capture the most variance, followed by the second principal component, and so on.
Each principal component is a linear combination of the original variables. Because all the principal components are orthogonal to each other, there is no redundant information. So, the total variance in the data is defined as the sum of the variances of the individual component. So decide the total number of principal components according to cumulative variance ‘‘explained’’ by them.
Some techniques other than PCA for dimensionality reduction include LDA which stands for Linear Discriminant Analysis ( Related Blog: Introduction to Linear Discriminant Analysis in Supervised Learning ).
In machine learning, Variance is one of the most important factors that directly affect the accuracy of the output. When a machine learning model becomes too sensitive for the independent variables, it tries to find out the relationship between every feature which gives rise to the problem like ‘overfitting’ or high variance. Too much noise enters the dataset because of high variance and thus results are affected. When we use principal component analysis for dimensionality reduction, the problem of overfitting get’s solved simultaneously.
Understanding PCA variance
Normalization is necessary to make every variable in proportion with each other, we have seen that the models which are not scaled properly in accordance with each other tend to perform poorly in comparison to those which are scaled well. We cannot extract features if two variables have a large scaling difference.
Consider two columns, one shows the distance in meters, and the other column shows the distance in km, therefore, 1000 in column one is equal to 1 in column two, but our model is unaware of it so how could it find out the accurate relationship between them, therefore scaling or normalization is a necessary condition for our model to perform. ( Also read: Introduction to Statistical Data Analysis )
You have a dataset that includes measurements for different sensors on an engine (temperatures, pressures, emissions, and so on). While much of the data comes from a healthy engine, the sensors have also captured data from the engine when it needs maintenance. You cannot see any obvious abnormalities by looking at any individual sensor. However, by applying PCA, you can transform this data so that most variations in the sensor measurements are captured by a small number of principal components. It is easier to distinguish between a healthy and unhealthy engine by inspecting these principal components than by looking at the raw sensor data.
You have a dataset that includes measurements for different variables on wine (alcohol, ash, magnesium, and so on). You cannot see any obvious abnormalities by looking at any individual variables. However, by applying PCA, you can transform this data so that most variations in the measurements of the variables are captured by a small number of principal components. It is easier to distinguish between red and white wine by inspecting these principal components than by looking at the raw variable data.
Before implementing the PCA algorithm in python first you have to download the wine data set. Below attach source contains a file of the wine dataset so download first to proceed
First of all, before processing algorithms, we have to import some libraries and read a file with the help of pandas.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.metrics import accuracy_score from matplotlib.colors import ListedColormap dataset = pd.read_csv(r'dataset) dataset.head()
As we call out the dataset with the help of the Pandas data frame, now we have to split our dataset into training and testing set with test size is 0.2 times of dataset and remaining data is our training data.
#split into dependant and independent variable x = dataset.iloc[:,0:13].values y = dataset.iloc[:,13].values #splitting dataset into a training set and test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 0)
The next step is to do feature scaling of train and test dataset with help of StandardScaler.
from sklearn.preprocessing import StandardScaler sc = StandardScaler() x_train = sc.fit_transform(x_train) x_test = sc.transform(x_test)
We are applying the PCA algorithm for two-component and fitting logistic regression to the training set and predict the result.
from sklearn.decomposition import PCA pca = PCA(n_components=2) x_train = sc.fit_transform(x_train) x_test = sc.transform(x_test) explained_variane = pca.explained_variance_ratio_ #fitting logistic Regression to training set from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state = 0) classifier.fit(x_train, y_train) #predicting results y_pred = classifier.predict(x_test) print("accuracy score:", accuracy_score(y_test,y_pred))
As we see that in predicting results our accuracy score came 0.97222222 is approx 97% which is good for predicting test set results. After predicting, we visualize our training set results using 2 components.
#visualising the Training set results X_set, y set = x_test, y_test X1, x2 = np.meshgrid(np.arange(start = X_set[:, 0].min()- 1 , stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start - X_set[:, 1].min() -1, stop = X_set[:, 0].max() + 1, step = 0.01)), plt.contourf(x1, x2, classifier.predict (np. array([X1.ravel(), X2.ravel()].T).reshape(X1.shape), alpha = 0.75, Cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X1.min(), X1.max()) for i,j in enumerate(np.unique(y_set)): plt. scatter(X_set[y_set== j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green', 'blue'))(i), label = j) plt.title('PCA using Logistic Regression (Training set)') plt.xlabel('PC1') plt.ylabel('PC2') plt.legend() plt.show()
Logistic regression to implement PCA
As we visualize the result on the training dataset, now we do a similar process for test set results and see our accuracy of the dataset using two components. As we take two components so we see that the first component shows most variation occurs between features and the second component shows most variation occurs between plotted features.
In the principal component space, you should be able to see your objects cluster in a meaningful way. As we learned the basic introduction of the PCA algorithm in this blog. More blogs are on the way where you will learn PCA and other machine learning algorithms in depth. Keep reading and exploring Analytics Steps. Till then, Happy Reading!
5 Factors Influencing Consumer BehaviorREAD MORE
Elasticity of Demand and its TypesREAD MORE
What is PESTLE Analysis? Everything you need to know about itREAD MORE
An Overview of Descriptive AnalysisREAD MORE
What is Managerial Economics? Definition, Types, Nature, Principles, and ScopeREAD MORE
5 Factors Affecting the Price Elasticity of Demand (PED)READ MORE
Dijkstra’s Algorithm: The Shortest Path AlgorithmREAD MORE
6 Major Branches of Artificial Intelligence (AI)READ MORE
Scope of Managerial EconomicsREAD MORE
7 Types of Statistical Analysis: Definition and ExplanationREAD MORE
motwanikushal14Nov 09, 2020