Importance of Statistics and Probability in Data Science

  • Rohit Dwivedi
  • Jul 13, 2020
Importance of Statistics and Probability in Data Science title banner

If you know today, Data scientist is a job profession that has become the hottest job in today’s’ era. People also call it a sexist job of the 21st century. If you are planning to pursue a career in Data science then probability and statistics are one of the things you should be aware of. 

 

They are essentials for getting into Data Science. It is said that you cannot learn data science without having knowledge of statistics and probability. People usually don't get much interest in these topics. 

 

However, I would change this thing for you today and will introduce you to explain the basics of statistics and probability with respect to data science.

 

Introduction

 

There is often a question raised;

  • Why do we need to learn statistics and probability? 

  • What roles do probability and statistics play in the data science field?

Let us make the thing logical and understandable about what is the significance of it? 

 

Making predictions and searching for different structures in data is the most important part of data science. They are important because they have the ability to handle different analytical tasks. Read more about the importance of Statistics given in the Springer article here. 

 

“Data Scientist is a person who is better at statistics than any programmer and better at programming than any statistician.” -  Josh Wills 

 

Therefore, statistics are a group of principles that are used to attain information about the data so as to make decisions. It unveils the secret hidden in the data. 

 

Probability and Statistics are involved in different predictive algorithms that are there in Machine Learning. They help in deciding how much data is reliable, etc.

 

 

What is Central Limit Theorem? 

 

It is a theorem that plays a very important role in Statistics. It states that the distribution of samples will be normally distributed if you have the mean (μ) and standard deviation (σ) of the population and huge randomized samples are chosen from the population with replacement. 

 

Read more about the theorem here.

 

 

Different terms used in Statistics? 

 

A person should have knowledge about often-used terminologies, broadly practiced in Statistics for data science. Let us understand the same -

 

  • Population - The place or a source from where the data has to be fetched or collected. 

  • Sample -  It is defined as a subset of the population. 

  • Variable - Data item that can be either a number or thing that can be measured.

  • Statistical Parameter -  It is defined as the quantity that leads to probability distribution like mean, median, and mode.


 

What is Statistical Analysis?

 

Statistical Analysis is the science of the exploration of the collection of large datasets to find different hidden patterns and trends. These types of analyses are used in every sort of data for example in research and multiple industries, etc so as to come to decisions that are to be modeled. There is mainly two types of Statistical Analysis- 

 

  1. Quantitative Analysis: The type of analysis is defined as the science of fetching and interpreting the data with graphs and numbers to search for underlying hidden trends. 

 

  1. Qualitative Analysis: The type of Statistical analysis that gives the common information by making use of text and other forms of media. 

 

You can read more here about different categories that are there in Statistical analysis here

 

Measures of Central Tendency

 

It is defined as the single value that aims to explore a set of data by recognizing the central position within the set of data. It is also called a measure of a central location that is also categorized as summary statistics. 

 

  • Mean - It is calculated by taking the sum of all the values that are present in the dataset and dividing that by the number of values in the data. 

  • Median -  It is the middle value in the dataset that gets in order of magnitude. It is considered over mean as it is least influenced by outliers and skewness of the data.

  • Mode - It is the most occurring value in the dataset. 

 

What is Skewness?

 

The curve that is distorted or skewed towards left or to the right. Asymmetry in statistical distribution is known as Skewness that specifies whether the data is intensive on one side. It tells about the distribution of the data. 


The image highlights the Pearson Mode skewness in the form of left-skewed and right-skewed.

Pearson Mode Skewness, Source  


Skewness is divided into two parts - 

 

  • Positive Skewness:  It occurs when the mean>median<mode. The tail is skewed to the right in this case, i.e outliers are skewed to the right.

  • Negative Skewness: It occurs when the mean<median<mode. The tail is skewed to the left, i.e the outliers are skewed to left.


 

What is Probability?

 

It is the base and language needed for most of the statistics. It is also defined as the phenomenon of a particular outcome by computing its importance in daily life. One cannot do data science problems without the knowledge of probability. It is considered to be an important factor in predictive analytics.  

 

There are mainly two types of hypothesis - 

 

  1. Null Hypothesis: Hypothesis where there is no notable difference between the described population. 

  2. Alternative Hypothesis: Hypothesis where there is a notable difference.  

 

In Statistical hypothesis testing, the probability value is also known as p-value is the probability of getting results at least as utmost as the results actually have been observed, making the assumption that the null hypothesis is correct.

 

If p value <= 0.05, the null hypothesis is rejected. 

If p-value >=0.05, the null hypothesis is accepted 

 

But why do we need to accept or reject a hypothesis? 

If we accept the null hypothesis, the independent features do not have any influence on the prediction of the target variable. If the null hypothesis is rejected it means that the feature will help in the prediction of the target variable.  

(You can read about p-value in statistics here

 

How to calculate p-value? 

 

The p-value is computed by examining the summary of linear relation formed between the target and features or between the dependent and independent variables. 

 

With the help of straight-line linear regression will help in building the relationship between these variables by making use of the formula y=mx + B. 

 

The points that are closed to the regression line are most important and they have the p <=0.05, so they are taken in consideration to predict y whereas the points are further from the line are not important, having p-value >=0.05 and are not taken in consideration to predict the target y. 

 

Conclusion 

 

Statistics and probability are the base of data science. One should know the fundamentals and concepts so as to solve the data science problems. It gives you the information about the data, how it is distributed, information about the independent and dependent variable, etc. 

 

In this blog, I have tried to give you the basic idea about statistics and probability. Yes, there is much more to be explored when we talk about Statistics and probability in Data Science. 

 

We have discussed the important, central limit theorem, statistical analysis, measure of central tendency, basic terminologies in statistics, and skewness. Also, I have given you the idea of a Hypothesis done in probability and how we can accept it and reject it on the basis of a p-value.

0%

Comments

  • MADHAVA CHALLA

    Jul 13, 2020

    Very Good Information about Data Science and importance.Present I am in Learning stage so i got lot of confidence. Thanks for the Article.

  • MADHAVA CHALLA

    Jul 13, 2020

    https://socialprachar.com/data-science/