Introduction To Principal Component Analysis In Machine Learning

  • Rohit Dwivedi
  • May 07, 2020
  • Machine Learning
  • Updated on: Jan 19, 2021
Introduction To Principal Component Analysis In Machine Learning title banner

Have you come across a situation where you have so many variables and you are unable to understand the relationship between each different variable? You are in that situation where you can overfit your model to the data. 


In these kinds of situations you need to lower down your feature space to understand the relationship between the variables that will result in less chances for overfitting. To reduce or lower down the dimension of the feature space is called “Dimensionality Reduction”. It can be achieved either by “Feature Exclusion” or by “Feature Extraction”.


Feature exclusion is about dropping variables and keeping only those features that can be used to predict the target whereas feature extraction is about extracting features from features. Suppose we have 5 independent features and we create 5 new features on the basis of old 5 features, this is the way features extraction works.  



What is Principal Component Analysis?


Principal Component analysis also known as PCA is such a feature extraction method where we create new independent features from the old features and from combination of both keep only those features that are most important in predicting the target. New features are extracted from old features and any feature can be dropped that is considered to be less dependent on the target variable.


Recommended blog: How to use the Random Forest classifier in Machine learning?


PCA is such a technique which groups the different variables in a way that we can drop the least important feature. All the features that are created are independent of each other. 


  • The concept behind PCA is to go for accurate data representation in a lower dimensional space.


 the data points (black dots) are projected to one line but the second line is closer to the actual points (less projection errors) than first one) 

Image: Source

In both the pictures above, the data points (black dots) are projected to one line but the second line is closer to the actual points (less projection errors) than first one) 


  • In the direction of largest variance the good line lies that is used for projection. 

  • It is needed to modify the coordinate system so as to retrieve 1D representation for vector y after the data gets projected on the best line.

  • In the direction of the green line new data y and old data x have the same variance.

  • PCA maintains maximum variances in the data.

  • Doing PCA on n dimensions generates a new set of new n dimensions. Principal component takes care of the maximum variance in the underlying data 1 and the other principal component is orthogonal to it that is 2.



When to Use PCA?


Case:1 When you want to lower down the number of variables, but you are unable to identify which variable you don't want to keep in consideration.


Case:2 When you want to check if the variables are independent of each other.


Case:3 When you are ready to make independent features less interpretable.


In above all the three cases you can use PCA.



Mechanism Of Principal Component Analysis


Principal Component Analysis steps


  • Initially start with standardization of data. 

  • Create a correlation matrix or covariance matrix for all the desired dimensions.

  • Calculate eigenvectors that are the principal component and respective eigenvalues that apprehend the magnitude of variance. 

  • Arrange the eigen pairs in decreasing order of respective eigenvalues and pick the value which has the maximum value, this is the first principal component that protects the maximum information from the original data.


Principal Component Analysis (Performance issues)


  • Potency of PCA is directly dependent on the scale of the attributes. PCA will choose the variable that has the highest attributes if they are on a different scale without taking care of correlation.
  • PCA can change if the changes are made in variable’s scale. 
  • Due to the existence of discrete data it can be challenging to interpret PCA. 
  • Effectiveness of PCA can be influenced by the appearance of skew in the data with long thick tails.
  • PCA is unproductive when relationships between attributes are non linear.


PCA for dimensionality reduction


  • PCA is also used for reducing the dimensions.

  • According to the respective eigenvalues arrange the eigenvectors in descending order.

  • Plot the graph of cumulative eigen_values.

  • Eigen vectors that have no importance contributing towards total eigenvalues can be removed for the analysis.


Image showing Plot of PCA and Variance Ratio

Plot of PCA and Variance Ratio


Hands on Principal Component Analysis


The dataset on which we will apply PCA is the iris data set which can be downloaded from UCI Machine learning repository. 

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# importing ploting libraries
import matplotlib.pyplot as plt 
from scipy.stats import zscore
from sklearn import datasets
iris = datasets.load_iris()
X =
X_std = StandardScaler().fit_transform(X)
cov_matrix = np.cov(X_std.T)
print('Covariance Matrix \n%s', cov_matrix)


covariance matrix with the help of numpy library

Covariance matrix



  • Importing the necessary libraries and the dataset.
  • Scaling the data using a standard scaler.
  • Computing covariance matrix.


X_std_df = pd.DataFrame(X_std)
axes = pd.plotting.scatter_matrix(X_std_df)


image showing scatter matrix of scaled data

Scatter matrix of scaled data

  • Plotted scatter matrix of the scaled data.
  • Calculated eigenvectors and eigenvalues.

eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
eigen_pairs = [(np.abs(eig_vals[i]), eig_vecs[ i, :]) for i in range(len(eig_vals))]
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)


cumulative variance is shown with the help of python library

plt.figure(figsize=(6 , 4)), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(4), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')


Graph represents Principal components VS variance ratio

Principal components VS variance ratio

First three principal components explain 99% of the variance in the data.

The three PCA will have to be named because they represent composite of original dimensions


For the jupyter notebook file that contains the code of applying PCA on iris data set can be found here.


Advantages of PCA


  • Lack of redundancy of data given the orthogonal components.
  • Reduction of noise since the maximum variation basis is chosen and so the small variations in the background are ignored automatically.


Disadvantages of PCA


  • It is difficult to evaluate the covariance in a proper way.
  • Even the simplest invariance could not be captured by the PCA unless the training data explicitly provides this information.





I will conclude the blog stating the importance of PCA. It plays a very unique and important role. In this blog, I have discussed the introduction to PCA, in what scenarios to make use of PCA. I have also stated the steps to do PCA, what are the performance and how it can be used for dimension reductionality. Discussed about the advantages and disadvantages for using PCA.