There is an old story where a father gives a pile of 4 wooden sticks and ask each of his sons to break that pile of sticks, everyone fails to do so, afterwards, the father gave every individual, one wooden stick from that pile and ask them to break it now, his sons were able to break it now. In this story, we learned that, individually one might be weak, but when we combine or whenever we form unity, we can become strong. This is what boosting is all about.
Boosting combines all the weak learners (the parameters that could not classify the problem properly) and after the combination, the majority vote is taken to classify which category the input falls in.
(Suggested blog: Machine learning algorithms)
Let’s consider an example, suppose we need to classify whether the given image is of a horse or a donkey, there could be various factors on which we could determine like height, width, long tail, and more. The problem is none of these factors can tell perfectly that the given image is of a donkey or a horse, therefore we will consider all these factors and do the majority voting, in short, making all the weak learners combine to form a strong predictor.
Above are the few weak learners by which we got some output, now if we do the majority voting, we will find that most of the weak learners tell that the input might be a horse. This is the concept of boosting.
What is ensemble learning?
Ensemble learning is used to boost up the machine learning model’s accuracy and efficiency, to enhance the accuracy, ensemble learning takes the decisions from various models and combines them in a few ways to get the best decision. These few ways are max voting that we discussed earlier or by taking the average.
The average method is easy to implement, considering you have three prediction scores p1, p2, and p3 from different models like logistic regression, decision tree, and K-nearest neighbour (KNN). Now in order to take the average, all we need to do is-:
p1 + p2 + p3 / 3
Some of the advanced techniques under ensembled learning are bagging and boosting. Let’s discuss both of them briefly-:
The term bagging here represents that the original dataset is distributed in several parts, where each part acts as a dataset for an individual model. The main dataset is divided into equal parts, to make the sub dataset size same as the original, we add some replacement so that we would have enough features to learn something from the dataset. This process is known as bootstrapping.
Bagging in Ensembled Learning
Above image is the perfect representation of bagging, here the original dataset is distributed into three subset, each sub dataset is fed to an model, most of the time it is decision tree, after that each model gives some predictions and at the end we combine all the predictions and with the help of max voting or averaging, we get the strong predictions, all the weak learners here, combines to form a strong learner and helps in increasing the accuracy of the model.
(Also read: What is LightGBM Algorithm?)
Boosting is a method which adds an extra layer of perfection to the model, you are now aware that in bagging each sub dataset goes through a model to predict an output, boosting comes into the picture to reduce the errors from subsequent models. There are a few steps involved in the working of boosting, let’s discuss them-:
The base algorithm assigns weights to the data points of the sub datasets in order to find the errors.
Errors are calculated using the difference between predicted values and actual values.
The data points which showed the error gets their weights updated, all these data points are assigned higher weights.
Another model is trained with the updated weights and this model tends to perform better than the previous one.
This way every consecutive model learns from the previous one and gives the better result, at the end the mean of all the outputs are taken to form an optimum outcome.
Above representation shows the combination of all the weak learners with their updated data points, and at the end we can see the generalized outcome from the combination of these weak learners. This concept also shows that a model may not perform well on the whole dataset, but can give better results when trained over a portion of the whole dataset.
Some of the boosting algorithms are-:
Let’s implement the AdaBoost algorithm, the principle remains the same as boosting in ensemble learning.
AdaBoost Algorithm Python Implementation Using Sklearn
Our first step here is to pre-process the dataset and split the dataset for training and testing part.
Step1: import necessary libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
We have imported pandas in order to read the dataset, numpy, and matplotlob python libraries.
ds = pd.read_csv('../datasets/titanic.csv')
Reading the dataset and printing a few features-:
This is how the dataset looks like.
In our next step, we are going to assign a numerical value to the sex of people, for males, we are going to assign ‘0’ and ‘1’ for the later.
df = ds[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]
ds['Sex'] = ds['Sex'].apply(quantify_sex)
In our next step we are going to assign a variable ‘X’ to the people who did not survived and ‘y’ to the one that survived.
X = ds[[each for each in ds.columns if each != "Survived"]]
y = pd.DataFrame(ds['Survived'], columns=['Survived'])
Importing train_test_split from sklearn to split the dataset.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Now, import AdaBoostClassifier from the sklearn and fit the dataset.
from sklearn.ensemble import AdaBoostClassifier
ac = AdaBoostClassifier(random_state=90)
Now in the next step we are training our model, as we can see with the help of sklearn and the adaboost classifier, the process has become really easy.
Now, its time to calculate the accuracy or score of our adaboost classifier.
We can see on the training dataset, the accuracy is about 85%. Now we shall calculate the accuracy score for testing the dataset.
While calculating the score for the testing dataset, we got the accuracy of 83 percent overall.
Adaboost algorithm is an exceptional method to boost up the performance of a machine learning model by combining all the weak learners together to form a strong predictor.
(Must read: Machine Learning Tools)
However, if the weak learners are in fact a lot weak, it may lead to the overfitting and if we dig in some more, we will find that boosting is very difficult to scale. By keeping a few limitations in mind, if the goal is to increase the productivity of the model, we must use this algorithm.