Machine Learning has been one of the most popular topics to study in recent years, thanks to the availability of massive amounts of data and high-potential processing equipment. It's a typical question to wonder what it is about machine learning that has made it so popular.
Machine learning is an area of research that enables a machine to learn new things automatically based on its experiences and data without the need for human involvement.
Nowadays computers, thanks to machine learning, do not require any additional modeling, such as statistical models, and they develop themselves as they acquire experience.
Machine Learning is a versatile field. It uses the knowledge of statistics and probability in different algorithms to study the available data and create innovative applications. Statistics is one of the core components of machine learning and data analytics. There are two types of statistics used in the field of machine learning namely:
Descriptive statistics: For continuous data types like age, they are used to summarise data, such as mean and standard deviation, whereas frequency and percentage are helpful for categorical data types like gender.
Inferential statistics: Instead of gathering all of the data and deriving conclusions about the full population, it is a method of collecting a selection of data points, referred to as a sample. Hypothesis testing, numerical characteristic estimates, data correlation, and other techniques are used to draw these interferences.
In this article, we will learn about the different elements of statistics used in machine learning and the major differences between statistics and machine learning.
Statistical Terminologies used in ML
Let's begin with the fundamental concept of statistics. It is a field of mathematics concerned with the gathering, analysis, interpretation, and visualization of empirical data.
Statistics is used in machine learning to study the data and ask questions about it, preprocessing and cleaning the data, selecting the correct features for model evaluation, and model prediction.
Statistical models are a class of mathematical models that are often described by mathematical equations that connect one or more variables to a rough representation of reality. The assumptions made by these statistical models usually depict a set of probability distributions that brings out its differences from mathematical, non-statistical, or machine learning models.
There are few statistical concepts that are required to study descriptive statistics. Let us look at an example. The following table lists out the characteristics of ten individuals who have applied for a home loan:
Characteristics of 10 loan applicants (source)
In a data set, elements are entities or subjects for which all the information is collected. The elements in the table above are the 10 applicants.
The characteristics of all the elements is called a variable. A variable contains different values for different elements. These are also known as attributes.
The different variables in the table are the marital status, mortgage, income, rank, year, and risk.
In the above example, the quality of the elements are depicted by variables like marital status, mortgage, rank and risk. Hence they are called qualitative variables.
The numerical variables in the above table are income and year.
Here, year is the discrete variable.
The continuous variable here is income.
(Must read: Types of data in statistics)
A population is the collection of all components of interest in a given topic. A parameter is a population characteristic.
The subset of any population is called a sample. A characteristic of a sample is called a statistic.
The arithmetic average of any data set is called the Mean of the set. To calculate the mean value, we have to add all the values of the set and divide it with the total number of values.
The Mean value of the incomes of the 10 applicants are:
(38,000 + 32,000 + 25,000 + 36,000 + 33,000 + 24,000 + 25,000 + 48,000 + 32,100 + 32,200) / 10 = $ 32,530
When there is an odd number of data values sorted in an ascending order, the middle value of that data is called the median. For an even number of values, the mean of the two middle data values is the median.
Here, the number of elements here are even, so after arranging them in ascending order, the two mid values are $32,100 and $32,200. Hence, the median income is $32,150.
The mode is the data value having the highest frequency of occurrence. Modes can exist for both quantitative and categorical variables, although only quantitative variables can have means or medians.
Since the income does not repeat for any of the applicants, there is no mode for the income.
(Related reading: Overview of Mean, Median & Mode)
The difference between the maximum and minimum value of a variable is known as the Range of that variable.
Range of the income= Max income - Min Income = $48,000 - $24,000 = $24,000
It is the average of the highest and the lowest numerical value in a data set.
Mid-range of the income = (Max income + Min income)/2
= ($48,000 + $24,000)/2
The variance of a population is defined as the average of the squared deviations from the mean, written as 𝜎².
The standard deviation of a set of numbers indicates how far the individual numbers deviate from the mean.
The pth percentile of a data set is the data value at or below which p percent of the values in the data set fall.
The 50th percentile is the median of the income. We have already calculated the median income, that is $32,150. 50% of the data lie at this value or below it.
In the given diagram, the first quartile (Q1) of a data set is the 25th percentile; the second quartile (Q2) is the median; and the third quartile (Q3) is the 75th percentile. The IQR is calculated by dividing the difference between the 75th and 25th observations by the formula: IQR = Q3 Q1.
Graph of Percentile range and Interquartile range (source)
The Z-score for a specific data item indicates how many standard deviations the data value is above or below the mean.
A positive value of Z implies that the value is above the average.
(Similar read: Z-test vs T-test)
Univariate Descriptive statistics & Bi-variate Descriptive Statistics
Patterns seen in univariate data can be described in a variety of ways, including central tendency (mean, mode, and median) and distribution (range, variance, maximum, minimum, quartiles, and standard deviation).
On the other hand, Bi-variate analysis is the examination of two variables in order to determine the empirical relationship between them. The most common plots used to show bivariate data are scatter plots and box plots.
A scatter-plot is a popular graph for two continuous variables. Scatter plots are also known as correlation plots since they demonstrate how two variables are linked.
The amount and direction of a linear relationship between two quantitative variables are expressed by the correlation coefficient r.
The correlation coefficient is:
A box plot, often known as a box and whisker plot, is used to depict the distribution of data. A box-plot is often used when one variable is categorical and the other is continuous.
(Suggested blog: Descriptive Analysis overview)
Statistics vs Machine Learning
Even though machine learning and statistics are very similar, there are some differences between them. The major difference between them lies in the purpose.
Statistics deals with mathematics, so it does not function without data. A statistical model is a data model that may be used to deduce anything about the connections in the data or to construct a model that can predict future values.
Machine learning is all about outcomes; you're probably working in a firm where your worth is completely determined by your performance.
Statistical modelling, on the other hand, is more concerned with discovering connections between variables and the importance of those associations, while also allowing for prediction.
We construct a line that minimizes the mean squared error across all of the data for the statistical model, providing the data is a linear regression with some random noise introduced, which is generally Gaussian in nature.
(Also read: Importance of statistics for data science)
Probability and statistics are integral parts of machine learning. In this article we have talked about the statistical concepts that are required for statistical modelling. There are also some basic differences between them that are also mentioned in this article.
(Must catch: Importance of Statistics and Probability in Data Science)