How to do Exploratory Data Analysis before building Machine Learning models?

  • Rohit Dwivedi
  • Apr 19, 2020
  • Machine Learning
How to do Exploratory Data Analysis before building Machine Learning models? title banner

When I was thinking about what could be my next blog all about considering data in my mind so the very first thing that popped into my mind was the first major step that plays a very crucial role in terms of data analysis that is Exploratory data analysis.


What is exploratory data analysis all about


Imagine you are going on a trip to some place. You always make a list of places you would visit. You check about places where you can stay. You make a complete itinerary. In short before going on a trip whatever investigations and planning you do is nothing but similar to exploratory data analysis which is done by data scientists.  


Exploratory data analysis is a task performed by data scientists to get familiar with the data. All the initial tasks you do to understand your data well are known as EDA.


Let's take an example to know more about EDA. I have taken two datasets, one from the Kaggle website which is called the Pima Indian diabetes database and another from UCI Machine Learning Repository that is the Iris dataset. Let us do EDA on both the datasets.


1. How to import the datasets?


Using function present in pandas we have imported both the data files.


After downloading the dataset you can import your dataset using a function in pandas called pd.read_csv. You can read the full documentation of pandas here.


2. Printing the first 5 rows of the dataset to see the first view of the dataset


View of first 5 rows of pima data-set.



View of first 5 row present in iris dataset.


3. What is Shape of a dataset?


The shape of the dataset is basically a representation of total rows and columns present in the dataset. You can explore .shape() function present in the pandas' package here. In the Pima diabetic dataset, we have 768 rows and 9 columns, similarly, in the iris dataset, we have around 150 rows and 5 columns.


Shape of both the datasets.


4. Descriptive statistics of the data-sets


In pandas, describe() function is used to view central tendency, mean, median, standard deviation, percentile & many other things to give you the idea about the data.


Generate descriptive statistics of both the data-sets.


5. Checking about correlation between features in a dataset


There is a function in the panda's package which allows you to check about the correlation between features which is pd.DataFrame.corr(). It calculates the correlation between features pairwise excluding null values. I have used this function to compute the correlation between features in the Pima dataset which is shown in the below image. 


Correlation between features in pima diabetic data-set.


6. Checking about data types and more information about the data


There is a function present in the pandas' package known as which returns the data type of each column present in the dataset. Also, it tells you about null and not null values present. So, in our dataset, we have even int64 data types values and also float64 data type values. 


Information about pima-dataset.


7. Checking about missing values in the data


Missing values in the data can be checked by using isnull() function present in pandas documentation. It returns the boolean values that are true and false. If you want to calculate how many missing values are present in each column in the data set you can make use of the function isnull().sum(). This function returns the total number of missing values in each column. 


In our case, in both the data sets we did not get any of the missing values in any of the columns.


Output of the isnull() function present in pandas to check about missing values.

Total number of missing values in the data-sets.


7. If missing values are present then how to impute them?


For various scenarios, while dealing with data you will come across real-world data which will have missing values like nan values, -, blanks. The basic approach to deal with such a situation is to drop/ remove the entire row or column which contains missing values. But dropping is not advisable because there will be a loss of data as well which can result in important parts of the data being removed. So, to deal with such things there are different methods used to impute the missing values. 


There are two ways by which missing values can be imputed: first is called univariate imputation and the other one is multivariate imputation. 


Univariate imputation is a type of imputation which imputes missing values considering only the non-missing values in that feature dimension. (e.g. impute.SimpleImputer). 


On the other hand, a multivariate imputer imputes the missing values considering all available features dimension.(e.g. impute.IterativeImputer). You can read more about imputing missing values here.



8. Encoding categorical features 


Often it is seen that we do not have continuous values in our features. There are sometimes categorical values. And the system cannot understand such values so there is a need to convert them to continuous numerical values. 


As seen in the below iris data frame we have classes as categorical features which are - ‘Iris-setosa’, ‘Iris-versicolor’, ‘Iris-virginica’.


How to encode them?



Unique values in class column of the iris data-set which contains categorical values.


There are several different techniques which are used to encode categorical values which are stated below:


a) LabelEncoder() - It is a function present in the scikit- learn library of python which is used to convert categorical values in numerical values.


Hands-on implementation of encoding using LabelEncoder.


Here we have imported LabelEncoder from sklearn.preprocessing followed by initialising of the object through which we will use the label encoder. We have made an object called “LE”. Then we have transformed our class column by using the LE.fit_transform function & printed the transformed class which is now [0,1,2]. It has given the values to Iris-setosa - 0, Iris-versicolor  - 1 , Iris-virginica - 2.


b) get_dummies() - Converts categorical features into dummy variables. 


c) OneHotEncoder() - Array-like of integers or strings is the required input for this encoder. The features are encoded using a one-hot encoding scheme. The result is a binary column for each category and reverts a sparse matrix. 



9. Standardization of data 


Standardization of data is a major important step that is required for machine learning algorithms to give good results. There are different scaling functions present in the preprocessing module of sci-kit learn. If data is not scaled and is passed to the algorithm the result might be wrong due to wrongly distributed data.


Why is it important to scale the data?


It is usually seen that we ignore checking the shape of the distribution and change the data to be centered. That is done by removing the mean values of each column and then scaling it by dividing non-constant columns by their standard deviation.


Different functions that are used by algorithms to learn to assume that all the desired features are centered as zero and also their variance is in the same structure. If any of the features have a higher proportion than all other features it may dominate the function for learning algorithm and does not allow learning from other features as required. 


a) Scale :  present in the pre-processing module gives a fast and effective way to do this operation on a single array-like data:


Hand-on implementation of scaling using scale.


X_scaled has now unit variance and zero mean as you can see in the below image.


Scaled data has zero mean and unit variance.


b) The pre-processing module also has different other classes like StandardScaler that are used in scaling the data that is converting the mean to be zero and standard deviation to be unit on training data which can be further used in test data as well. Such class can also be used in building pipelines also.


The code implementation of standard scaler is shown below.


Code implementation of Standard scaler.


c) Scaling features to a range:  There are other methods also to scale data within a respective range that is a min values and max value. It mainly ranges between 0 and 1. You can use MinMaxScaler or MaxAbsScaler for scaling the data respectively.


d) Scaling sparse data: Centering the scatter data would result in knock-down of sparsity structure of data thus it is not advisable to do. MinMaxScaler and MaxAbs scaler were introduced to scale the sparse data. Scalar often accepts both CSR (Compressed Sparse Rows) & also CSC (Compressed Sparse Columns). If there is any other different sparse input then it is converted to Compressed Sparse Rows. To take care of the memory it is advisable to convert it in CSR and CSC representation.


e) Scaling data with presence of outliers : If the data has outliers in it then scaling that sort of data using mean and variance is not a good approach. You can use robust_scale & Robust_Scaler as drop in substitution.


10. Normalization of data 


It is the process of scaling each sample to have a unit standard. These types of techniques are much more effective if you are computing the similarity between different pairs of samples or to use a quadratic form like a dot product.This is the base of models used in text classifications. 


There is a function in the pre-processing module that is normalized which provides a good way to execute such operations on single array-like data by using l1 or l2 standards. Implementation of normalizing data using normalize is shown in the below image.


Code to normalize the data using Normalize function.


Pre-processing module also has another class that is called a normalizer that executes the similar operations using the transformer API. This class can also be used in the initial stage of pipeline. Implementation of normalizer is shown below in the image.


Code to normalize the data using normalizer function.


If you want to look for code implementation of EDA discussed above you can refer to the GitHub link here. It contains a jupyter file and both the datasets which are used. 




In this blog, I have tried to explain some operations which are done in exploratory data analysis to get a better understanding of the data. Techniques like missing values, standardization, normalization, shape, correlation between independent features also descriptive statistics of the data are discussed. There can be various other things that can be done in EDA to get a better understanding that is dependent on what type of data we have. EDA in textual data or image data is entirely different which will be covered in different blogs dedicated to the image or textual data. 



  • 4mytempuse

    Jul 11, 2020

    I’m a Professional and experienced Data Analyst. I will do EDA(Exploratory Data Analysis) for businesses to take important decision. contact me here :-