This is an era of computers, machines, and artificial intelligence. The field of Data Science is so ever-expanding that whatever skills you add to your arsenal are short enough. When you talk about Data Science, Statistics is an automatic area of interest as almost every Machine Learning Technique you use is based on some core statistical concept. Understand the importance of statistics in data science through the link.
One of the fundamental statistical concepts that are being used in machine learning is Linear Regression. In this article, we will take you on the journey where we together will explore the capabilities of python while fitting and running a linear regression model in it.
What is Regression?
Regression is a statistical technique that allows us to find relationships among several variables. It allows us to figure out the impact of one or more variables over the other. For Example, You can observe all students from class 12th in a college and figure out the variables that will impact students’ final grades.
Variables on which final grades are dependent could be the number of hours of study, number of hours of sleep, an environment that student lives in, number of playing hours, number of lectures a student bunk, etc.
This is a classic regression problem where each student is an observation and factors such as the number of study hours, number of sleep hours, number of lectures bunked, etc. are assumed to be independent of each other.
Since they are independent of each other, they are often known as independent variables or regressors. On the other hand, final grades are dependent on all these variables, and hence the final grade is considered a dependent variable or regressand.
(Similar read: What is regression analysis?)
What is Linear Regression?
Linear regression is a statistical regression technique in which we have one regressand or dependent variable and one or more than one regressor. The approach of modeling or finding a relationship between these two is linear and hence it is known as linear regression. If we have one regressor then it is simple linear regression, if we have more than one regressor, it is known as multiple linear regression.
The dataset we are going to use in this example is named “Auto MPG Data Set” which is taken from the StatLib library that is maintained by Carnegie Mellon University. The dataset provides technical aspects and specifications of cars.
The data is designed in such a way that we can predict the city-cycle fuel consumption in miles-per-gallon based on three multivariate discrete variables and five continuous variables. The data consists of 398 observations with 9 variables.
We have a csv copy of this data and it is a very common dataset being used for the regression problem. See the screenshot below:
The Auto-MPG Dataset
Now, let us head towards actually setting the regression model up.
There are several libraries we are going to import and use while running a regression model up in python and fitting the regression line to the points. We will import pandas, numpy, metrics from sklearn, LinearRegression from linear_model which is part of sklearn, and r2_score from metrics which is again a part of sklearn. See the code below for your reference.
# importing libraries
import pandas as pd
import numpy as np
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
There is nothing to show in the output window for this code as it just is importing these packages so that we can use them while building our code.
(Must check: First step towards Python)
Step 1: Reading the Dataset
We can use the read_csv() method to read the mpg dataset which we have into a csv format at working directory of the python. Following is the code for the same. The file is stored on the path “C:\Users\lsalunkhe”
#readig dataset into the python environment
mpg_df = pd.read_csv(r"C:\Users\lsalunkhe\mpg_data.csv")
Note that there is this “r” before the actual file path inside the function. It stands for “raw” and it allows the system to consider those backslashes as a part of the file path. Otherwise, the system will consider them as special characters and will throw an error.
The dataframe should look like the one shown below:
Reading mpg dataset into python as a data frame
Step 2: Setting the target and Regressors up
The target variable for us would be mpg. Since we are working with linear regression, we will go with the single variable linear regression. Our regressor would be displacement. We are interested in checking how much displacement is affecting mpg. Set these two variables separate from the dataframe so that we could work on them.
#Setting target and regressor variables separate from dataframe
part_df = mpg_df[["mpg", "displacement"]]
#Setting target and regression variables up
y = mpg_df.mpg
X = part_df[["displacement"]]
Step 3: Fitting Linear Regression Model and Predicting Results
Now, the important step, we need to see the impact of displacement on mpg. For this to observe, we need to fit a regression model. We will use the LinearRegression() method from sklearn.linear_model module to fit a model on this data. After that, we will make predictions based on the fitted model. See the code below:
#Fitting simple Linear Regression Model
linr_model = LinearRegression().fit(X, y)
There is nothing this code could generate as an output and we could see nothing as an output for this code.
Now, the important part of model fitting is generating intercept and coefficient. The intercept and coefficient allow us to fit an equation for linear regression and then predictions are on the cards.
#Model Fitting Results
The equation of linear regression is as below:
y = 0 + 1X
y - is the target variable
0 - is the intercept (weight predicted by the model). It is often referred to as the mean value of the target variable (y) when the regressor is zero (x = 0).
1 - is the regression coefficient or slope (again a predicted weight by model).
X - is the regressor that helps in predicting the target.
If we run the code above, we could see the intercept and slope values as shown below:
Intercept and slope values
Now, if you would like to use these slope and intercept values to build the linear regression equation, it would be as shown below:
mpg = 35.1748 + -0.0603*displacement
Now, based on this equation, all the predictions will happen in the model.
Let us see the code below which predicts the mpg based on displacement.
#Making Predictions based on the coefficient and intercept
Here, we just have called the predict() method from linear_model on displacement variable from partial dataframe, and then the system will predict the mpg values based on the above equation for each value of displacement. The result would be an array as shown in the output below:
An array of predicted mpg for corresponding displacement values
(Also read: How do Linear and Logistics Regression work in ML?)
Step 4: Looking at variation Explained by the Regressor
An important measure that determines the efficiency of your model is the R-squared value. It is a statistical measure that allows you to see how much variability between dependent variables is explained by the independent variable. It is also known as the coefficient of determination.
There is no threshold value set for R-squared. But generally, the more the R-squared value, the better the model fitted is. Let us compute the R-squared value and see how well the model is fitted.
y_true = part_df.mpg,
y_pred = linr_model.predict(part_df[["displacement"]])
Here, the r2_score() is a function that gives you the coefficient of determination value. The actual and predicted values are set under the y_true and y_pred arguments. Now see the output below to figure out how good your model is.
Coefficient of determination to measure the goodness of fit of the model
Now here, you could see that the value for the coefficient of determination is 0.6467 which means the regressor (displacement) was able to explain 64.67% (almost 65%) of the variability of the target (mpg). In other words, the predicted mpg values are almost 65% close to the actual mpg values. And this is a good fit in this case.
Step 5: Plotting the Relationship Between vehicle mpg and the displacement
We are going to use the plotnine library to generate a custom scatter plot with a regression line on it for mpg vs displacement values. This chart will explain the relationship between these two variables and the best thing is it is with custom themes and colors. See the code below:
#making custom visualization of mpg vs displacement
from plotnine import ggplot, aes, geom_point, geom_line
from plotnine.themes import theme_minimal
part_df["fitted"] = linr_model.predict(part_df[["displacement"]])
ggplot(aes("displacement", "mpg" ), part_df) \
+ geom_point(alpha = 0.5, color = "#2c3e50") \
+ geom_line(aes(y = "fitted"), color = 'blue') \
If we run this code, we can see the scattered plot with regression line on it below:
Regression plot for mpg vs displacement
This article ends here. Let us close this article with some points to be remembered.
(Suggested blog: NLP Libraries with Python)
The LinearRegression() function from sklearn.linear_regression module to fit a linear regression model.
Predicted mpg values are almost 65% close (or matching with) to the actual mpg values. Means based on the displacement almost 65% of the model variability is explained.
The Plotnine library and methods from it are used to plot the custom scatter plot with a regression line on it.
Based on the scatter plot, we can say that the lower the value of displacement, the higher the mpg value will be. Meaning, there is a negative correlation between these two variables. This was also evident by the negative value of slope (remember slope or beta one is -0.0603).