How to do Exploratory Data Analysis Using Pandas Profiling?

  • Rohit Dwivedi
  • Jul 05, 2020
  • Machine Learning
How to do Exploratory Data Analysis Using Pandas Profiling? title banner

Introduction 

 

You might know the strengths of Exploratory Data Analysis if you come from a data science background. It is one of the most important steps that are done to understand the data.

 

I don't see anyone building predictive models without doing the EDA on the dataset. EDA helps in checking the various different things about the data like checking for missing values, outliers, mean, median, distribution of data, correlations

 

There are functions in pandas like describe(), info(), IsNull() that help in understanding the data well. But all this takes many lines of code or you need to do much coding to understand the data. So, the question arises” is there any way that is a short and efficient way to do these EDA in fewer lines of codes.” And the answer to the question is - “Yes, we do

 

“Pandas Profiling”, is a python package by which EDA can be done with a single line of code. This package returns the report in the HTML file format that helps to analyze the data quickly. 


 

Statistics that are given by Pandas Profiling 

 

Pandas profiling python package is a tool that gives the HTML format report that consists of different statistics about the data. For a dataset, it gives the following statistics: 

 

  • Essentials: type, missing values, unique values,

  • Quantile Statistics like min value, Q1, median, Q3, interquartile range,

  • Descriptive Statistics like mean, median, mode, SD, etc,

  • Most often values,

  • Histogram, and

  • Correlations that give highly correlated variables.


 

How to Install Pandas Profiling Package 

 

You can install the package using pip and conda as shown in the below image. 



Let us quickly do EDA using pandas profiling on Boston data. You can use the below code to import the dataset. 


Libraries and code to import dataset.

Code illustration: 1 


After installing the pandas profiling package, I have imported the package and Boston dataset from the sklearn.datasets package.


Showing the sample of the dataset having 506 rows and 13 columns.

Sample of Dataset


The above images show the sample of the dataset having 506 rows and 13 columns.


Code to generate report.
Code illustration: 2


After this, I have created an object that holds my report as a prof and saved it to a file named as “Boston_Prof_Report” in the HTML format. The above image shows the code of how you can generate the reports and save it in HTML format.


 

Let us see the sections that are present in the reports

 

  1. Overview 

 

  • It gives the overall dataset information. The overview section consists of 2 sections that are “Dataset info” and “Variable Types”. 

  • Dataset info displays different variables like columns and rows, missing cells, duplicate, etc whereas variable types present how many attributes are numerical type, categorical, boolean, etc. It also presents certain warnings where it unveils the attributes that are highly correlated to others. 

 

Check the below image that shows the overview of the Boston dataset.


Image is showing first section of report.

The overview section of the generated report



 

  1. Variables 

 

  • This section gives information about every feature one by one Far from the overview section that presents the information about the whole dataset. The section unveils info like unique points with its percentage, missing values with percentage. Further, it gives min and max values with the percentage of zeroes in that column.

 

Check the below image that shows the variable of the Boston dataset.


Second section of the report.

Variable section of the report


  • If you will click on the Toggle details option that is present in the extreme right corner a new section will come that you can see in the below image.


 


Toogle Subsection of the report.

Toggle Sub-section of the Variable section of the Report 


  • The Toggle details subsection unveils the Quantile Statistics that give details like min, Q1, median, Q3, max, IQR, etc whereas Descriptive Statistics gives details like skewness, variance, mean, sum, etc.

  • The histogram gives a visualization of the feature’s frequency.

  • Common values give counts and frequency percentages of the attributes.

  • Extreme values provide the top 5 min and max count.

 

  1. Correlation 

 

This section of the report gives the visualization of how attributes are correlated to each other with the seaborn package heatmaps. You can get an understanding of the relationship between the features. 

 

You can also toggle and see different correlations like Pearson, Spearman, Kendall, and phik.

 

Check the below image that shows the correlation between the features of the Boston dataset.


Correlation Section of the Report


 

  1. Missing Values

 

The missing value section gives two subsections that are matrix and count.

In the matrix graph, we can see missing values whereas the count graph presents the count of data points in each attribute. 

 

Check the below images that show the missing value section of the Boston dataset.


Missing value section showing count as an sub-section.

Missing value count section of the report 


 

Missing values section showing matrix as sub-section.

Missing values matrix


  1. Sample Section 

 

This section of the report presents the sample of the first 10 rows i.e, head and last 10 rows i.e tail for the dataset.

Check the below images that show the sample section of the Boston dataset.


First rows section of Sample section of the report.

Sample Section showing sub-section first rows of the data


 


Last rows section of Sample section of the report.

Sample Section showing sub-section last rows of the data



 

Pandas Profiling De-merit

 

The main demerit of pandas profiling is to generate reports for bigger datasets. As the data size increases the time to generate the report also increases. 

 

It can be solved by generating reports for the respective number of rows like where the data frame is passed to be converted to report there with df you can give df.sample(n=1000). This will generate reports only for 1000 rows.

 

You can check the Github link of Pandas Profiling here for more information on the same. Also, you can check the report (HTML file) of the Boston dataset that is discussed above here. 

 

 

Conclusion

 

Exploratory Data analysis is one of the first steps that is performed by anyone who is doing data analysis. It is important to know everything about data first rather than directly building models over it. 

 

Python Packages like Pandas Profiling and SweetViz are used today to do EDA with fewer lines of code. Both the packages generate reports that consist of everything about the data.

 

In this blog, I have discussed how you can make use of the Pandas Profiling python package to do exploratory data analysis on different datasets by generating reports that present an overview of the data, variable, correlations, missing values, and a sample of the data. In the last section, I have discussed where doing EDA by using these packages can trouble you. 

0%

Comments