You might know the strengths of Exploratory Data Analysis if you come from a data science background. It is one of the most important steps that are done to understand the data.
I don't see anyone building predictive models without doing the EDA on the dataset. EDA helps in checking the various different things about the data like checking for missing values, outliers, mean, median, distribution of data, correlations.
There are functions in pandas like describe(), info(), IsNull() that help in understanding the data well. But all this takes many lines of code or you need to do much coding to understand the data. So, the question arises” is there any way that is a short and efficient way to do these EDA in fewer lines of codes.” And the answer to the question is - “Yes, we do.
“Pandas Profiling”, is a python package by which EDA can be done with a single line of code. This package returns the report in the HTML file format that helps to analyze the data quickly.
Statistics that are given by Pandas Profiling
Pandas profiling python package is a tool that gives the HTML format report that consists of different statistics about the data. For a dataset, it gives the following statistics:
Essentials: type, missing values, unique values,
Quantile Statistics like min value, Q1, median, Q3, interquartile range,
Descriptive Statistics like mean, median, mode, SD, etc,
Most often values,
Correlations that give highly correlated variables.
How to Install Pandas Profiling Package
You can install the package using pip and conda. If you have already installed pip and conda then the below code will install panda-profiling for you.
pip install pandas-profiling
Let us quickly do EDA using pandas profiling on Boston data. You can use the below code to import the dataset.
Code illustration: 1
After installing the pandas profiling package, I have imported the package and Boston dataset from the sklearn.datasets package.
from sklearn.datasets import load_boston
boston = load_boston()
The shape of the Sample of Dataset
The above images show the sample of the dataset having 506 rows and 13 columns.
from pandas_profiling import ProfileReport
Code illustration: 2
After this, I have created an object that holds my report as a prof and saved it to a file named “Boston_Prof_Report” in the HTML format. The above image shows the code of how you can generate the reports and save it in HTML format.
Let us see the sections that are present in the reports
The below code will generate a report on our dataset
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
profile = ProfileReport(bos, title=' Pandas Profiling Report', explorative = True)
It gives the overall dataset information. The overview section consists of 2 sections that are “Dataset info” and “Variable Types”.
Dataset info displays different variables like columns and rows, missing cells, duplicate, etc whereas variable types present how many attributes are numerical type, categorical, boolean, etc. It also presents certain warnings where it unveils the attributes that are highly correlated to others.
Check the below image that shows the overview of the Boston dataset.
The overview section of the generated report
This section gives information about every feature one by one Far from the overview section that presents the information about the whole dataset. The section unveils info like unique points with its percentage, missing values with percentage. Further, it gives min and max values with the percentage of zeroes in that column.
Check the below image that shows the variable of the Boston dataset.
Variable section of the report
Toggle Sub-section of the Variable section of the Report
The Toggle details subsection unveils the Quantile Statistics that give details like min, Q1, median, Q3, max, IQR, etc whereas Descriptive Statistics gives details like skewness, variance, mean, sum, etc.
The histogram gives a visualization of the feature’s frequency.
Common values give counts and frequency percentages of the attributes.
Extreme values provide the top 5 min and max count.
This section of the report gives the visualization of how attributes are correlated to each other with the seaborn package heatmaps. You can get an understanding of the relationship between the features.
You can also toggle and see different correlations like Pearson, Spearman, Kendall, and phik.
Check the below image that shows the correlation between the features of the Boston dataset.
Correlation Section of the Report
The missing value section gives two subsections that are matrix and count.
In the matrix graph, we can see missing values whereas the count graph presents the count of data points in each attribute.
Check the below images that show the missing value section of the Boston dataset.
Missing value count section of the report
Missing values matrix
This section of the report presents the sample of the first 10 rows i.e, head, and last 10 rows i.e tail for the dataset.
Check the below images that show the sample section of the Boston dataset.
Sample Section showing sub-section first rows of the data
Sample Section showing sub-section last rows of the data
Pandas Profiling De-merit
The main demerit of pandas profiling is to generate reports for bigger datasets. As the data size increases the time to generate the report also increases.
It can be solved by generating reports for the respective number of rows like where the data frame is passed to be converted to report there with df you can give df.sample(n=1000). This will generate reports only for 1000 rows.
You can check the Github link of Pandas Profiling here for more information on the same. Also, you can check the report (HTML file) of the Boston dataset that is discussed above here.
Exploratory Data analysis is one of the first steps that is performed by anyone who is doing data analysis. It is important to know everything about data first rather than directly building models over it.
Python Packages like Pandas Profiling and SweetViz are used today to do EDA with fewer lines of code. Both the packages generate reports that consist of everything about the data.
In this blog, I have discussed how you can make use of the Pandas Profiling python package to do exploratory data analysis on different datasets by generating reports that present an overview of the data, variable, correlations, missing values, and a sample of the data. In the last section, I have discussed where doing EDA by using these packages can trouble you.