You might know the strengths of Exploratory Data Analysis if you come from a data science background. It is one of the most important steps that are done to understand the data.
I don't see anyone building predictive models without doing the EDA on the dataset. EDA helps in checking the various different things about the data like checking for missing values, outliers, mean, median, distribution of data, correlations.
There are functions in pandas like describe(), info(), IsNull() that help in understanding the data well. But all this takes many lines of code or you need to do much coding to understand the data. So, the question arises” is there any way that is a short and efficient way to do these EDA in fewer lines of codes.” And the answer to the question is - “Yes, we do.
“Pandas Profiling”, is a python package by which EDA can be done with a single line of code. This package returns the report in the HTML file format that helps to analyze the data quickly.
Pandas profiling python package is a tool that gives the HTML format report that consists of different statistics about the data. For a dataset, it gives the following statistics:
Essentials: type, missing values, unique values,
Quantile Statistics like min value, Q1, median, Q3, interquartile range,
Descriptive Statistics like mean, median, mode, SD, etc,
Most often values,
Correlations that give highly correlated variables.
You can install the package using pip and conda as shown in the below image.
Let us quickly do EDA using pandas profiling on Boston data. You can use the below code to import the dataset.
Code illustration: 1
After installing the pandas profiling package, I have imported the package and Boston dataset from the sklearn.datasets package.
Sample of Dataset
The above images show the sample of the dataset having 506 rows and 13 columns.
Code illustration: 2
After this, I have created an object that holds my report as a prof and saved it to a file named as “Boston_Prof_Report” in the HTML format. The above image shows the code of how you can generate the reports and save it in HTML format.
It gives the overall dataset information. The overview section consists of 2 sections that are “Dataset info” and “Variable Types”.
Dataset info displays different variables like columns and rows, missing cells, duplicate, etc whereas variable types present how many attributes are numerical type, categorical, boolean, etc. It also presents certain warnings where it unveils the attributes that are highly correlated to others.
Check the below image that shows the overview of the Boston dataset.
The overview section of the generated report
This section gives information about every feature one by one Far from the overview section that presents the information about the whole dataset. The section unveils info like unique points with its percentage, missing values with percentage. Further, it gives min and max values with the percentage of zeroes in that column.
Check the below image that shows the variable of the Boston dataset.
Variable section of the report
If you will click on the Toggle details option that is present in the extreme right corner a new section will come that you can see in the below image.
Toggle Sub-section of the Variable section of the Report
The Toggle details subsection unveils the Quantile Statistics that give details like min, Q1, median, Q3, max, IQR, etc whereas Descriptive Statistics gives details like skewness, variance, mean, sum, etc.
The histogram gives a visualization of the feature’s frequency.
Common values give counts and frequency percentages of the attributes.
Extreme values provide the top 5 min and max count.
This section of the report gives the visualization of how attributes are correlated to each other with the seaborn package heatmaps. You can get an understanding of the relationship between the features.
You can also toggle and see different correlations like Pearson, Spearman, Kendall, and phik.
Check the below image that shows the correlation between the features of the Boston dataset.
Correlation Section of the Report
The missing value section gives two subsections that are matrix and count.
In the matrix graph, we can see missing values whereas the count graph presents the count of data points in each attribute.
Check the below images that show the missing value section of the Boston dataset.
Missing value count section of the report
Missing values matrix
This section of the report presents the sample of the first 10 rows i.e, head and last 10 rows i.e tail for the dataset.
Check the below images that show the sample section of the Boston dataset.
Sample Section showing sub-section first rows of the data
Sample Section showing sub-section last rows of the data
The main demerit of pandas profiling is to generate reports for bigger datasets. As the data size increases the time to generate the report also increases.
It can be solved by generating reports for the respective number of rows like where the data frame is passed to be converted to report there with df you can give df.sample(n=1000). This will generate reports only for 1000 rows.
Exploratory Data analysis is one of the first steps that is performed by anyone who is doing data analysis. It is important to know everything about data first rather than directly building models over it.
Python Packages like Pandas Profiling and SweetViz are used today to do EDA with fewer lines of code. Both the packages generate reports that consist of everything about the data.
In this blog, I have discussed how you can make use of the Pandas Profiling python package to do exploratory data analysis on different datasets by generating reports that present an overview of the data, variable, correlations, missing values, and a sample of the data. In the last section, I have discussed where doing EDA by using these packages can trouble you.
Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working EcosystemREAD MORE
What is the OpenAI GPT-3?READ MORE
Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & ModelsREAD MORE
6 Major Branches of Artificial Intelligence (AI)READ MORE
Top 10 Big Data Technologies in 2020READ MORE
7 types of regression techniques you should know in Machine LearningREAD MORE
How is Artificial Intelligence (AI) Making TikTok Tick?READ MORE
7 Types of Activation Functions in Neural NetworkREAD MORE
8 Most Popular Business Analysis Techniques used by Business AnalystREAD MORE
Introduction to Logistic Regression - Sigmoid Function, Code ExplanationREAD MORE