When I was thinking about what could be my next blog all about considering data in my mind so the very first thing that popped into my mind was the first major step that plays a very crucial role in terms of data analysis that is Exploratory data analysis.
Imagine you are going on a trip to some place. You always make a list of places you would visit. You check about places where you can stay. You make a complete itinerary. In short before going on a trip whatever investigations and planning you do is nothing but similar to exploratory data analysis which is done by data scientists.
Exploratory data analysis is a task performed by data scientists to get familiar with the data. All the initial tasks you do to understand your data well are known as EDA.
Let's take an example to know more about EDA. I have taken two datasets, one from the Kaggle website which is called the Pima Indian diabetes database and another from UCI Machine Learning Repository that is the Iris dataset. Let us do EDA on both the datasets.
The shape of the dataset is basically a representation of total rows and columns present in the dataset. You can explore .shape() function present in the pandas' package here. In the Pima diabetic dataset, we have 768 rows and 9 columns, similarly, in the iris dataset, we have around 150 rows and 5 columns.
In pandas, describe() function is used to view central tendency, mean, median, standard deviation, percentile & many other things to give you the idea about the data.
There is a function in the panda's package which allows you to check about the correlation between features which is pd.DataFrame.corr(). It calculates the correlation between features pairwise excluding null values. I have used this function to compute the correlation between features in the Pima dataset which is shown in the below image.
There is a function present in the pandas' package known as pd.dataFrame.info() which returns the data type of each column present in the dataset. Also, it tells you about null and not null values present. So, in our dataset, we have even int64 data types values and also float64 data type values.
Missing values in the data can be checked by using isnull() function present in pandas documentation. It returns the boolean values that are true and false. If you want to calculate how many missing values are present in each column in the data set you can make use of the function isnull().sum(). This function returns the total number of missing values in each column.
In our case, in both the data sets we did not get any of the missing values in any of the columns.
For various scenarios, while dealing with data you will come across real-world data which will have missing values like nan values, -, blanks. The basic approach to deal with such a situation is to drop/ remove the entire row or column which contains missing values. But dropping is not advisable because there will be a loss of data as well which can result in important parts of the data being removed. So, to deal with such things there are different methods used to impute the missing values.
There are two ways by which missing values can be imputed: first is called univariate imputation and the other one is multivariate imputation.
Univariate imputation is a type of imputation which imputes missing values considering only the non-missing values in that feature dimension. (e.g. impute.SimpleImputer).
On the other hand, a multivariate imputer imputes the missing values considering all available features dimension.(e.g. impute.IterativeImputer). You can read more about imputing missing values here.
Often it is seen that we do not have continuous values in our features. There are sometimes categorical values. And the system cannot understand such values so there is a need to convert them to continuous numerical values.
As seen in the below iris data frame we have classes as categorical features which are - ‘Iris-setosa’, ‘Iris-versicolor’, ‘Iris-virginica’.
There are several different techniques which are used to encode categorical values which are stated below:
a) LabelEncoder() - It is a function present in the scikit- learn library of python which is used to convert categorical values in numerical values.
Here we have imported LabelEncoder from sklearn.preprocessing followed by initialising of the object through which we will use the label encoder. We have made an object called “LE”. Then we have transformed our class column by using the LE.fit_transform function & printed the transformed class which is now [0,1,2]. It has given the values to Iris-setosa - 0, Iris-versicolor - 1 , Iris-virginica - 2.
b) get_dummies() - Converts categorical features into dummy variables.
c) OneHotEncoder() - Array-like of integers or strings is the required input for this encoder. The features are encoded using a one-hot encoding scheme. The result is a binary column for each category and reverts a sparse matrix.
Standardization of data is a major important step that is required for machine learning algorithms to give good results. There are different scaling functions present in the preprocessing module of sci-kit learn. If data is not scaled and is passed to the algorithm the result might be wrong due to wrongly distributed data.
It is usually seen that we ignore checking the shape of the distribution and change the data to be centered. That is done by removing the mean values of each column and then scaling it by dividing non-constant columns by their standard deviation.
Different functions that are used by algorithms to learn to assume that all the desired features are centered as zero and also their variance is in the same structure. If any of the features have a higher proportion than all other features it may dominate the function for learning algorithm and does not allow learning from other features as required.
a) Scale : present in the pre-processing module gives a fast and effective way to do this operation on a single array-like data:
X_scaled has now unit variance and zero mean as you can see in the below image.
b) The pre-processing module also has different other classes like StandardScaler that are used in scaling the data that is converting the mean to be zero and standard deviation to be unit on training data which can be further used in test data as well. Such class can also be used in building pipelines also.
The code implementation of standard scaler is shown below.
c) Scaling features to a range: There are other methods also to scale data within a respective range that is a min values and max value. It mainly ranges between 0 and 1. You can use MinMaxScaler or MaxAbsScaler for scaling the data respectively.
d) Scaling sparse data: Centering the scatter data would result in knock-down of sparsity structure of data thus it is not advisable to do. MinMaxScaler and MaxAbs scaler were introduced to scale the sparse data. Scalar often accepts both CSR (Compressed Sparse Rows) & also CSC (Compressed Sparse Columns). If there is any other different sparse input then it is converted to Compressed Sparse Rows. To take care of the memory it is advisable to convert it in CSR and CSC representation.
e) Scaling data with presence of outliers : If the data has outliers in it then scaling that sort of data using mean and variance is not a good approach. You can use robust_scale & Robust_Scaler as drop in substitution.
It is the process of scaling each sample to have a unit standard. These types of techniques are much more effective if you are computing the similarity between different pairs of samples or to use a quadratic form like a dot product.This is the base of models used in text classifications.
There is a function in the pre-processing module that is normalized which provides a good way to execute such operations on single array-like data by using l1 or l2 standards. Implementation of normalizing data using normalize is shown in the below image.
Pre-processing module also has another class that is called a normalizer that executes the similar operations using the transformer API. This class can also be used in the initial stage of pipeline. Implementation of normalizer is shown below in the image.
If you want to look for code implementation of EDA discussed above you can refer to the GitHub link here. It contains a jupyter file and both the datasets which are used.
In this blog, I have tried to explain some operations which are done in exploratory data analysis to get a better understanding of the data. Techniques like missing values, standardization, normalization, shape, correlation between independent features also descriptive statistics of the data are discussed. There can be various other things that can be done in EDA to get a better understanding that is dependent on what type of data we have. EDA in textual data or image data is entirely different which will be covered in different blogs dedicated to the image or textual data.
Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working EcosystemREAD MORE
What is the OpenAI GPT-3?READ MORE
Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & ModelsREAD MORE
6 Major Branches of Artificial Intelligence (AI)READ MORE
Top 10 Big Data Technologies in 2020READ MORE
7 types of regression techniques you should know in Machine LearningREAD MORE
How is Artificial Intelligence (AI) Making TikTok Tick?READ MORE
7 Types of Activation Functions in Neural NetworkREAD MORE
8 Most Popular Business Analysis Techniques used by Business AnalystREAD MORE
Introduction to Logistic Regression - Sigmoid Function, Code ExplanationREAD MORE