As Josh Wills put it, “A data scientist is a person who is better at statistics than any programmer and better at programming than any statistician.”
From the previous blog, you must have acquired a brief note about Statistical Data Analysis. In order to understand statistics properly, it demands one of the most important aspects as understanding statistical modelling. Forwarding the discussion a step ahead, we will discuss the concept of statistical modelling, some general statistical terms, common statistical techniques, and a brief note on statistical modelling vs machine learning.
“Primarily, the main purpose of statistics is to explain and anticipate information.”
In the more modest words, Statistical Modeling is an interpreted, mathematically-prescribed method to approximate truth which is being generated by the data and for making forecasts out of this approximation.
For example, depicting a quantity through an average and a standard deviation is the simple form of statistical modelling. And here, the statistical model is the mathematical expression that is being deployed.
“Statistical Modelling is simply the method of implementing statistical analysis to a dataset where a Statistical Model is a mathematical representation of observed data.”
(Also read: Types of Statistical Analysis)
The statistical model can be expressed as a combination of results depending on consolidated data and population's understanding that are deployed to foretell information in a generalized form. Therefore, a statistical model could be an equation or a visual portrayal of the information on the basis of thorough research conducted over the years.
"Modern statisticians are familiar with the notion that any finite body of data contains only a limited amount of information on any point under examination; that this limit is set by the nature of the data themselves, and cannot be increased by any amount of ingenuity expended in their statistical examination: that the statistician's task, in fact, is limited to the extraction of the whole of the available information on any particular issue." -R. A. Fisher
In other words, for recognizing relationships between two or more variables, statistical models exist. And since there are different types of variables, correspondingly, different statistical models are there. Some common types of statistical models are Correlation Test, Regression model, Analysis of Variance, Analysis of Covariance, Chi-square, etc.
When a data expert implements different statistical models to the data to examine, understand and decipher the information more imperatively, through this approach, the data expert identifies connections among variables, makes prophecies, and visualizes data that can be used and leveraged even by any non-analyst.
5 Statistical Techniques for Data Analysis
Linear Regression is the technique that is used to predict a target variable by providing the best linear relationship among the dependent and independent variable where best fit indicates the sum of all the distances amidst the shape and actual observations at each data point is as minimum as achievable. There are two types of linear regression mainly, that are;
Simple Linear Regression: It deploys a sole independent variable to predict a dependent variable by providing the most suitable linear correlation. In order to understand Simple Linear Regression in detail, click the link.
Multiple Linear Regression: It takes more than one independent variable for predicting the dependent variable by providing the most suited linear relation. There is much more to explore about Multiple Linear Regression, learn with this guide.
Being a data mining technique, Classification authorizes specific categories to a collection of data for making more meticulous predictions and analysis. Types of classification technique are;
Discriminant Analysis: In this analysis, two or more clusters (populations) are referred to as a priori and the new set of observations are grouped into one of the known clusters depending on computed features. It displays the distribution of the predictors “X” distinctly in each of the response classes and employs Bayes theorem to pitch these classes in terms of estimates for the probability of the response class, given the value of “X”.
(Click here to understand more about Linear Discriminant Analysis (LDA))
Top 5 Statistical Techniques for Data Analysis
(Related reading: Binary and multiclass classification in machine learning)
The approach of extracting repeated pieces of samples from the actual data samples is known as Resampling which is a non-parametric method of statistical inference.
Also, depending upon the original data, it produces a novel sampling distribution and employs experimental methods instead of analytical methods for generating specific sampling distribution. For understanding the resampling method, the below techniques also need to understand;
Bootstrapping: From validation of a predictive model and its performance, ensemble methods, estimation of bias to the variance of the model, Bootstrapping technique is used in these conditions. It operates through sampling with replacement from the actual data and accounts the “not selected” data points as test samples.
Cross-Validation: This technique is used in order to validate the model performance, and can be executed by dividing the training data into K parts. During cross validation execution, the K-1 part can be considered as training ser and the rest made out part acts as a test set. Up to K times, the process is repeated and then the average of K scores is accepted as performance estimation.
Tree-based methods are the most commonly used techniques for both regression and classification problems. They incorporate layering or detaching the predictor space in terms of a number of manageable sections and are also known as decision-tree methods because the particular splitting rules are applied to fragment the predictor space that can be reviewed in a tree.
Moreover, the below methods grow multiple trees, integrated to produce precise prognostications.
Bagging: It decreases the variance of prediction through producing extra data for training out of actual dataset by implementing “combinations with repetitions” for creating multi-step of the equivalent size as of original data. In actuality, the model predictive strength can’t be improved by enhancing the size of the training set, but the variance can be reduced, closely adjusting the prediction to an anticipated upshot.
Boosting: This approach is used to compute the outcome through diverse models and after that average of the result is calculated applying a weighted average approach. Via integrating the benefits and deadfalls of this approach and varied weighting formula, an appropriate predictive efficiency can be fetched for an extensive chain of input data.
(Must catch: Clustering methods and applications)
Unsupervised Learning techniques come into the picture and can be applied when the groups or categories across the data are not known. Clustering and the association rules are the common approaches (examples) of unsupervised learning in which various sets of data are assembled into strictly related groups (categories) of items.
Some unsupervised learning algorithms are discussed below;
Common Statistical Terminologies
Following the basic definition of some statistical terms that are often practised during statistical data analysis.
Types of Variables
Dependent Variable, also known as Response Variable: The dependent variable is the one that an individual wants to explain, interpret or predict.
Explanatory Variable, also known as Independent Variable: The explanatory variable is the one that is used to explain, interpret or predict the dependent variable.
However, both dependent and explanatory variables might be single or multiple, or qualitative or quantitative.
(Must read: Types of data in statistics)
Model Parameters and Model Residuals
Since the dependent variable(s) is associated with the explanatory variable through a mathematical equation (the model), it involves some quantities that are termed as Model Parameters.
For example, in a simple linear equation, other than dependent and independent variables, parameters are an intercept and the slope. With the help of computation, behind statistical modelling, model parameters can be estimated and assist in further predictions of the dependent variable.
In addition to that, Model Residuals are also parameters, for example, the variance of residuals in the simple linear regression model. Basically, Model Residuals (or errors) are the distances amid data points and the model. They depict the portion(part) of variability in the data that was not seized by the model. Like the R² statistic is the part of variability the model describes where the lower the residuals, the higher the R² statistic.
(Recommended blog: What are Model Parameters and Evaluation Metrics?
Statistical Modelling vs Machine Learning
With the implementation of Statistics, a Statistical Model forms an illustration of the data and performs an analysis to conclude an association amid different variables or exploring inferences.
And Machine Learning is the adoption of mathematical and or statistical models in order to get customized knowledge about data for making foresight.
Statistical Modelling Perspective
Statistical models incorporate distinct variables that are practised for interpreting connections amidst various sorts of variables. In statistical models, specific techniques such as hypothesis testing, confidence intervals, etc in order to make insights that validate defined condition/hypothesis.
For example, in regression analysis, a various number of variables are used to identify the impact of the explanatory variable on the independent variable.
From sampling, probability spaces, assumptions to diagnostics approaches are used for making inferences.
Apparently, over a specific set of data, statistical models are utilized to identify interferences on a relatively small set of data for understanding the nature of underlying data.
However, not all the statistical models are perfect or wrong but are deployed for measuring authenticity. For example, the fundamental assumptions of the model are far too strict that are not characteristics of authenticity.
(Suggested blog: Types of statistical data distributions models)
Machine Learning Perspective
According to the definition provided by Andrew Ng,” Machine learning is the science that makes computers enable to learn and perform even without being explicitly programmed.
Since it lets computers learn similar to humans, this learning has the potential to function over an enormous quantity of data which is beyond the human potential to identify and understand the trends and patterns from such amounts of data. From the computation power and storage capacity, computers have outperformed humans.
(Most related: Top 10 Machine Learning Algorithms)
Broadly speaking, machine learning is used to make predictions and its exemplary performance can be evaluated by knowing how generalized it is across a new amount of data which has not been fed before.
A process of cross-validation is conducted in order to check the data’s probity that makes sure the ML model doesn’t overfit (memorize) or underfit (having not sufficient data to learn) to the provided amount of data.
The data is cleaned and regulated in a way that it can be understood by the machine easily, and no minimal statistics can proceed into this process.
There are various types of techniques in machine learning to make predictions such as “Classification”, “Regression”, “Clustering”, etc. Moreover, in ML models error measures like RMSE, MSE etc can be deployed for regression and True positives and False positives can be used for classification problems. (Source)
"The sciences do not try to explain, they hardly even try to interpret, they mainly make models. By a model is meant a mathematical construct which, with the addition of certain verbal interpretations, describes observed phenomena. The justification of such a mathematical construct is solely and precisely that it is expected to work" -John Von Neumann
There are ample instances where statistical modelling can be implemented for solving complex problems, and while concluding the blog, you came to know the introductory approach of statistical model, statistical modelling along with top-five statistical techniques including linear regression, classification, resampling methods, tree-based models, and unsupervised learning.
In addition to that, you have seen some general differences between statistical modelling and machine learning.