• Category
  • >Machine Learning

Principal Component Analysis with Python Code Example

  • Tanesh Balodi
  • Oct 28, 2020
  • Updated on: Oct 28, 2020
Principal Component Analysis with Python Code Example title banner

The fundamental purpose of this post is to brief regarding the PCA algorithm step by step and in a way that everyone can easily understand what can actually PCA do and how we can use PCA in the project/algorithm. Before proceeding here is a quick overview of what we cover in this post.

 

Topics Covered

  1. What is PCA?

  2. What is a Variance?

  3. Why is Normalization Necessary for PCA?

  4. Practical Examples of PCA

  5. Code in Python 

 

What is Principal Component Analysis (PCA)?

 

PCA is an unsupervised machine learning algorithm. PCA is mainly used for dimensionality reduction in a dataset consisting of many variables that are highly correlated or lightly correlated with each other while retaining the variation present in the dataset up to a maximum extent. It is also a great tool for exploratory data analysis for making predictive models. 

 

While it is said that the more data we have, the more accurate results we are going to observe, while it is rightly said but it’s not just data that we require, we need high-quality data to get better results. Dimensionality reduction is basically a technique where we reduce the number of columns or features from the dataset based on their relevance to the problem, the least their requirement, the more are the chances that they get removed from the dataset.

 

PCA performs a linear transformation on the data so that most of the variance or information in your high-dimensional dataset is captured by the first few principal components. The first principal component will capture the most variance, followed by the second principal component, and so on. 

 

Each principal component is a linear combination of the original variables. Because all the principal components are orthogonal to each other, there is no redundant information. So, the total variance in the data is defined as the sum of the variances of the individual component. So decide the total number of principal components according to cumulative variance ‘‘explained’’ by them.

 

Some techniques other than PCA for dimensionality reduction include LDA which stands for Linear Discriminant Analysis ( Related Blog: Introduction to Linear Discriminant Analysis in Supervised Learning ).

 

What is a Variance?

 

In machine learning, Variance is one of the most important factors that directly affect the accuracy of the output. When a machine learning model becomes too sensitive for the independent variables, it tries to find out the relationship between every feature which gives rise to the problem like ‘overfitting’ or high variance. Too much noise enters the dataset because of high variance and thus results are affected. When we use principal component analysis for dimensionality reduction, the problem of overfitting get’s solved simultaneously.

 


Image represents the graph of high variance in the dataset.

                                                      Understanding PCA variance 


Why is Normalization Necessary for PCA?

 

Normalization is necessary to make every variable in proportion with each other, we have seen that the models which are not scaled properly in accordance with each other tend to perform poorly in comparison to those which are scaled well. We cannot extract features if two variables have a large scaling difference.

 

Consider two columns, one shows the distance in meters, and the other column shows the distance in km, therefore, 1000 in column one is equal to 1 in column two, but our model is unaware of it so how could it find out the accurate relationship between them, therefore scaling or normalization is a necessary condition for our model to perform. ( Also read: Introduction to Statistical Data Analysis )

 

 

Practical Examples of PCA

 

Example: Engine Health Monitoring

 

You have a dataset that includes measurements for different sensors on an engine (temperatures, pressures, emissions, and so on). While much of the data comes from a healthy engine, the sensors have also captured data from the engine when it needs maintenance. You cannot see any obvious abnormalities by looking at any individual sensor. However, by applying PCA, you can transform this data so that most variations in the sensor measurements are captured by a small number of principal components. It is easier to distinguish between a healthy and unhealthy engine by inspecting these principal components than by looking at the raw sensor data.

 

Example: Wine Detection

 

You have a dataset that includes measurements for different variables on wine (alcohol, ash, magnesium, and so on). You cannot see any obvious abnormalities by looking at any individual variables. However, by applying PCA, you can transform this data so that most variations in the measurements of the variables are captured by a small number of principal components. It is easier to distinguish between red and white wine by inspecting these principal components than by looking at the raw variable data.

Before implementing the PCA algorithm in python first you have to download the wine data set. Below attach source contains a file of the wine dataset so download first to proceed

 

Code In Python

Source: Wine.csv

First of all, before processing algorithms, we have to import some libraries and read a file with the help of pandas.


import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.metrics import accuracy_score

from matplotlib.colors import ListedColormap



dataset = pd.read_csv(r'dataset)

dataset.head()

A representation of dataset with a collection of various features

Wine Dataset


As we call out the dataset with the help of the Pandas data frame, now we have to split our dataset into training and testing set with test size is 0.2 times of dataset and remaining data is our training data.


#split into dependant and independent variable

x = dataset.iloc[:,0:13].values

y = dataset.iloc[:,13].values



#splitting dataset into a training set and test set

from sklearn.model_selection import train_test_split



X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 0)


The next step is to do feature scaling of train and test dataset with help of StandardScaler.


from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train = sc.fit_transform(x_train)

x_test = sc.transform(x_test)

We are applying the PCA algorithm for two-component and fitting logistic regression to the training set and predict the result.


from sklearn.decomposition import PCA

pca = PCA(n_components=2)

x_train = sc.fit_transform(x_train)

x_test = sc.transform(x_test)

explained_variane = pca.explained_variance_ratio_



#fitting logistic Regression to training set

from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state = 0)

classifier.fit(x_train, y_train)



#predicting results



y_pred = classifier.predict(x_test)

print("accuracy score:", accuracy_score(y_test,y_pred))


As we see that in predicting results our accuracy score came 0.97222222 is approx 97% which is good for predicting test set results. After predicting, we visualize our training set results using 2 components.


#visualising the Training set results



X_set, y set = x_test, y_test



X1, x2 = np.meshgrid(np.arange(start = X_set[:, 0].min()- 1 ,

                               stop = X_set[:, 0].max() + 1,

                               step = 0.01),

                     np.arange(start - X_set[:, 1].min() -1,

                               stop = X_set[:, 0].max() + 1,

                               step = 0.01)),



plt.contourf(x1, x2, classifier.predict

    (np. array([X1.ravel(), X2.ravel()].T).reshape(X1.shape),

     alpha = 0.75, Cmap = ListedColormap(('red', 'green')))

plt.xlim(X1.min(), X1.max())

plt.ylim(X1.min(), X1.max())



for i,j in enumerate(np.unique(y_set)):

    plt. scatter(X_set[y_set== j, 0], X_set[y_set == j, 1],

 c =  ListedColormap(('red', 'green', 'blue'))(i), label = j)

             

plt.title('PCA using Logistic Regression (Training set)')

plt.xlabel('PC1')

plt.ylabel('PC2')

plt.legend()

plt.show()


Implemented PCA using logistic regression on the training dataset.

Logistic regression to implement PCA


As we visualize the result on the training dataset, now we do a similar process for test set results and see our accuracy of the dataset using two components. As we take two components so we see that the first component shows most variation occurs between features and the second component shows most variation occurs between plotted features.

 

Conclusion

 

In the principal component space, you should be able to see your objects cluster in a meaningful way. As we learned the basic introduction of the PCA algorithm in this blog. More blogs are on the way where you will learn PCA and other machine learning algorithms in depth. Keep reading and exploring Analytics Steps. Till then, Happy Reading!

Latest Comments

  • motwanikushal14

    Nov 09, 2020

    Nice article!