Beginning with the definition of regression, for determining the significance and potential of the relationships between a dependent variable and a series of independent variables, a statistical method is used, known as regression.
When multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple, (from)
Two basics types of regression are;
Linear regression attempts to identify the connection amid the two variables along a straight line. Simply, this model is used to predict or show the relationship between a dependent variable and an independent variable.
When we have two or more independent variables used in regression analysis, the model is no longer simply linear, instead, it is a multiple regression model. We will discuss multiple linear regression throughout the blog.
What is Multiple Linear Regression (MLR)?
Multiple linear regression is simply the extension of simple linear regression, that predicts the value of a dependent variable (sometimes it is called as the outcome, target or criterion variable) on the basis of two or more independent variables (or sometimes, the predictor, explanatory or regressor variables).
It is a method of statistical analysis that provides the statistical significance to explanatory variables, or which potential explanatory variables are crucial predictors for a given response (target) variable.
For example, MLR can be used for predicting exam performance on the basis of revision time, lecture attendance, gender, or time anxiety. Or secondly, daily cigarette consumption can be determined/predicted by the parameters like duration of smoking, age when initiated smoking, type of smoker/smoking, gender and income.
It can be used to determine the impact of changes, i.e to understand the changes in the dependent variable while making changes in the independent variables. For example, reviewing the health of a person to check how much blood pressure goes up and down with a unit change in the body mass index of that person, keeping other factors constant.
“Multiple linear regression is a mathematical technique that deploys the relationship among multiple independent predictor variables and a single dependent outcome variable.”
The methodology also involves the various means of determining which variables are important and can be implemented to make a regression model for prediction considerations.
Examples of MLR
The concept of multiple linear regression is applicable in some of the below-listed examples;
Since the dependent variable is associated with independent variables, it can be applicable while predicting the expected crop yield with the consideration of climate factors such as a certain rainfall, temperature and fertilizer level, etc.
In order to find the connection between the GPA of a class of students and the number of study-hours and their height. Here the dependent variable is GPA and the number of study-hours and student’s heights is explanatory variables.
For determining the salary of a batch of executives in a company and the number of years of experience and the age of executives, regression analysis can be used. Here, the dependent variable for this regression is the salary of executives, and the experience and age of the executives are independent variables.
It is highly used in anticipating trends and future values/events. For example, rain forecast in coming days, or price of gold/silver in the coming months from the present time.
An example of identifying the relationship between the distance covered (dependent variable) by the cab driver and the age of the driver and years of experience (independent variables).
Multiple linear regression Formula
The equation of multiple linear regression is expressed as;
yi=ß0+ ß1 xi1+ ß2 xi2 +........+ ßp xip+ Ø
yi= dependent variable,
xi= explanatory variables, here we have “p” predictor variables and “p+1” as total regression parameters.
ß0= y-intercept which is a constant term,
ßp= Slope coefficient for each explanatory variable, and
Ø= residuals (model’s error term), having a normal distribution with mean 0 and constant variance,
In multiple linear regression, the word linear signifies that the model is linear in parameters, ß0, ß1, ß2 and so on.
Assumptions for MLR
While choosing multiple regression to analyze data, part of the data analysis process incorporates identifying that the data is we want to investigate may actually be analyzed using multiple linear regression via assuring some assumptions, listed below;
Relationship between dependent and independent variables
The very first assumption is that there should be linear relationships between a dependent variable and each of the independent variables. To best mean to check this linear relationship is a scatter plot is created and then inspected for linearity.
If the relationship presented in the scatterplot is non-linear, then the non-linear regression is executed, or the data is transformed using statistical software, such as SPSS.
The independent variables are not much correlated with each other
The data values must not exhibit multicollinearity, this takes place when the explanatory variables are highly correlated to each other. However, when the independent variables display multicollinearity, this can make difficulty in fetching the concerned variable that contributes to the variance in the dependent variable. To test this assumption, a Variance Inflation Factor method is employed.
The residual variance is constant
In multiple linear regression, it is assumed that the quantity of errors in the residuals is identical at each point of the linear model which is noted as Homoscedasticity.
While examining the data, you should plot the standard residuals against predicted values in order to check whether the points are correctly distributed over all the values of independent variables.
To test this assumption, scatter plots can be used or by using any statistical software to make scatter plots, including the entire model.
Independence of observations
The MLR model assumes that all the observations should be independent of each other, or in other words, residuals values should be independent of one another. Durbin Watson Statistic is considered as the best choice to test this assumption.
Basically, this method exhibits values from 0 to 4, where values, in the range 0 to 2, present positive autocorrelation, the values, from 2 to 4, show negative autocorrelation, and the midpoint value, i.e. 2, displays no autocorrelation is there.
When the residuals (or errors) are normally distributed appropriately, then the multivariate normality occurs. For testing this assumption, you should check how the residual values are distributed by the several methods like histograms with a superimposed normal curve or the normal probability plot.
5 Steps Workflow of Multiple Linear Regression
You must ensure to analyze and prepare data before diving into the model introspection, the data is checked for errors, missing values treatment, outliers inspection, and establishing their validity. You can understand the 5 steps qualitative data analysis process from the link.
Besides that, above-mentioned assumptions must be followed to validate the predictive accuracy of a regression model. And the data should be accurate before conducting the below steps;
Workflow for multiple linear regression
Understanding of the dataset, knowing the relevance, quality of data and adequacy of data volume are imperative to pick out the right variables. While building the regression model, you will be choosing the best predictor variables that have the most direct relationships with the selected response variable.
So, you need to gain maximum information from a minimum number of variables and can adopt below methods for the variable selection process,
Opt automatic search procedure, and let the R/Python or other tool decide which variables are best, stepwise regression analysis can be approached to do this.
Adopt all-possible-regression to check all subparts of significant independent variables, by this method you can choose the numerical criteria to have the models ranking, some numerical criteria are;
R2(coefficient of determination), variables having larger R2 values are the best fit variables for the model and always increases as more predictors are added to the model. R2 can have values between 0 and 1, where 0 expresses that the outcome can’t be predicted by any of the independent variables and value 1 signifies that the outcome can be predicted from the independent variables without errors.
Adjusted R2, variables having larger adjusted R2 are the better fit variables for the model.
PRESSp (predicted sum of squares), smaller the PRESSp, better the predictive strengths of the model.
Refining the model
You can check the significance and improve the model by examining the following criteria,
Global F-test, to test the significance of independent variables to predict the response of a dependent variable.
Adjusted R2, to check the complete sample variation of the dependent variable that is defined by the model after adjusting the sample size and number of parameters. Adjusted R2 values show how the predictive equation fits well to data. Larger the value of adjusted R2, variables are a better fit for the model.
Root mean square error (RMSE), to obtain the estimation for the standard deviation of random errors. An interval of ±2 standard deviations estimates the accuracy for anticipating the dependent variable on the basis of a particular subpart of independent variables.
Coefficient of variation (CV), if a model has the CV value as 10 % or less, then the model is likely to render accurate predictions.
Testing all assumptions of the model
In this step, all the assumptions are tested for a linear regression model and all must be satisfied to validate the outcomes of multiple linear regression.
For example, data should be homoscedastic and should exhibit an absence of multicollinearity, and residuals would be normally distributed. A dependent variable would be linear with predictor variables and there is no autocorrelation amid variables.
Addressing significant problems with the model
It often happens that one of the assumptions of the model will be violated, that time, you should be able to fix or minimize the problem that is against the assumptions, for example,
If the data is heteroscedastic, the dependent variable can be transformed.
If the residuals are non-normal, you can check if the problem is related with large outliers, remove them to correct the non-linearity in residuals, etc.
If independent variables are correlated, you can try to take one out of them.
If a model is yielding errors due to missing values, then those values can be treated or dummy variables can be used to cover.
Validating the model
The last step is to validate the regression model, for this, you can consider such methods;
- To check the predicted values by picking new data and test it against the outcomes that are predicted by the model
- Cross-validation of the outcomes by segmenting the data into randomly-chosen samples, now deploy first-half data to estimate model parameters and other half data to test the predictive outcomes of the regression model.
Being an important algorithm in machine learning, Multiple linear regression is one of the most useful approaches that tracks the correlation between continuous variables, and to determine the variation of the model and the relative contribution of each independent variable across total variance. However, it is simpler than any other types of statistical analysis methods.
As of now, you have understood the entire theory behind it, and its applications and workflow, besides that you may also have looked at various other regression algorithms/models.