• Category
  • >Machine Learning

Introduction to Cross-Validation in Machine Learning

  • Ayush Singh Rawat
  • Jun 14, 2021
Introduction to Cross-Validation in Machine Learning title banner



In today’s time, it has become important for a machine to give the fastest and reliable results to cope with the speed of the ever changing-world. So, machines are given a machine learning model to look upon and replicate the results. 


Validating the stability of your machine learning model is always necessary. You can't just fit the model to your training data and expect it to perform correctly with real data it's never seen before.


You need some assurance that your model has correctly identified the majority of the patterns in the data and is not very sensitive to noise, or that it has low bias and variance.



What is Cross-validation?


Cross-validation is an approach for estimating machine learning model performance (or accuracy) through the expertise of statistics. It is a technique in which we train our model using a subset of the data set and then assess it using the other portion.


It is used for prevention, particularly when the available data is constrained, of overfitting in a prediction model. A specific number of folds (or scatters) are constructed, analyses are performed on each fold, and the general estimate of the error is averaged in cross validation.


When working on a Machine Learning task, you must first identify the problem in order to select the most appropriate algorithm that will give you the best result. One technique to achieve best results is to test the model on the same dataset that you used to train it, although this might not be very helpful.


Our primary goal is for the model to perform well on real-world data; yet, while the training dataset is also real-world data, it only represents a small portion of all the data points (examples) available.


So, in order to determine the model's true score, it must be tested on data that it has never seen before, which is referred to as a testing set.



What is the procedure of cross-validation?


The following is the general procedure:


  1. Randomly shuffle the dataset,

  2. Organize the data into k groups,

  3. For each distinct group, write,

  •  As a holdout or test data set, use the group. 

  •  Use the remaining groupings, in the form of a training data set.

  •  Fit a model to the training set and test it against the test set.

  •  Keep the assessment score and chuck out the model.

  1. Using the sample of model assessment scores, summarise the model's ability.


Importantly, each observation in the data sample is allocated to a distinct group and remains there throughout the method. This means that each sample has the chance to be utilised in the hold out set once and to train the model k times.


(Related blog: What is K-means clustering in ML?)



Purpose of Cross-validation


When we train a model on the training set, it tends to overfit most of the time, thus we utilise regularisation approaches to avoid this. Because we only have a few training instances, we must be cautious while lowering the number of training samples and conserving them for testing.


The easiest method to enhance the system's performance without sacrificing too much is to verify it using a tiny portion of the training data, since this will give us an indication of the model's capacity to predict unknown data.


K-fold cross-validation is a prominent type of cross-validation approach in which, for example, if k=10, 9 folds are used for training and 1 fold is used for testing, and this process repeats until all folds have a chance to be the test set one by one. 


This gives us a decent indication of the model's generalisation capabilities, which is useful when we only have a limited amount of data and can't afford to separate it into test and training data.


(Must read: Top Machine Learning Algorithms)



Types of Cross-validation


The 5 types of Cross-Validation are:

Types of cross-validation

  1. K-Fold Cross-Validation


There is never enough data to train a machine learning model. Even if we eliminate some of the data, the Machine Learning model is at risk of becoming overfit. It's also feasible that it won't detect a dominating pattern if the training phase isn't given enough data.


By decreasing the data, we run the risk of losing accuracy owing to bias-induced inaccuracy. To solve this challenge, we'll need a strategy that provides enough data for training and a little amount of data for testing. K-fold Cross-validation does this.



  1. Stratified K-fold Cross-Validation


The k-fold Cross-Validation is adjusted more or less in this methodology. It shifts such that each fold has roughly the same percentage of samples from each target class as the entire set. 


When it comes to prediction difficulties, the mean response value in all folds is roughly equal. There is a significant imbalance in the response variables in some circumstances. 


Let's look at an example to better grasp this. The prices of some properties might be much higher than the prices of other properties in a home pricing dilemma. 


Furthermore, in classification issues, the negative instances may outnumber the good instances. We use the stratified k-fold Cross-Validation methodology in Machine Learning to address this issue.


(Must check: What is Cross Entropy?)



  1. Holdout Method


Among all the cross-validation methods, this is the simplest. We allocate data points to two data sets at random using this strategy. In this scenario, the size does not matter.


The basic concept is to take a portion of your training set and use it to forecast from a model that has been trained on the rest of the data. Because all of this is done in a single run, this approach has a lot of volatility. It may also produce false findings.



  1. Leave-p-out Cross-Validation


p data points are left out of the training data in this method. If the data set has m data points, then the training phase will employ m-p data points. 


The validation set is made up of the p data points because the preceding method is performed for all potential combinations in the original data set; this methodology is fairly thorough. 


The error is averaged over all trials to determine the model's overall efficacy because the model must train and verify for all conceivable combinations and a big p, it becomes computationally infeasible.


(Related blog: Machine Learning Tutorial)



  1. Leave-one-out Cross-Validation


This Cross-validation method is identical to Leave-p-out Cross-validation, with the exception that p = 1 in this case. It truly saves a significant amount of time, which is a significant benefit.


Even if the sample data is very huge, it might still take a long time. However, it would be faster than using the Leave-p-out cross-validation approach.



Advantages of Cross-Validation 


  1. Use All Your Data


Splitting data into training and test sets when there isn't much of it can result in a relatively small test set. If we only have 100 samples, a simple 80–20 split will yield 20 in our test set. It is insufficient. 


Due to chance, we can achieve practically any performance on this set. When we have a multi-class problem, the situation becomes significantly worse. If we have ten classes and only twenty examples, we will have just two examples per class on average. It's impossible to draw any meaningful conclusions based on simply two cases.


In this situation, we may utilise cross-validation to create K distinct models, allowing us to make predictions on all of our data. We produce a forecast for each case using a model that hasn't seen this case before, giving us a total of 100 cases in our test set. 


On average, we obtain ten instances for each class in the multi-class issue, which is substantially better than only two. We can now train our model on all of our data after evaluating our learning method, because since our 5 models performed similarly using various train sets, we expect that training it on all of the data would provide comparable results.


We may utilise all 100 instances for training and testing while assessing our learning algorithm on cases it has never seen before by using cross-validation.


(Must read: Exponential Smoothing and its types)



  1. Use Models Stacking


We can't train both models on the same dataset since our second model will learn from predictions made by our first model. These will very certainly be overfitted or, at the very least, produce better outcomes than a different set. 


This means that our second algorithm is trained on a different set of data than the one that will be evaluated. This might result in a variety of difficult-to-understand implications in our final assessment.


We may create predictions on our dataset in the same way that we did previously by using cross-validation, and our second model's input will be true predictions on data that our first model has never seen before.



  1. Parameters Fine-Tuning


Cross validation is used for a variety of reasons, one of which being the most prevalent and evident. The majority of learning algorithms need some parameter tweaking. It might be the number of trees in a Gradient Boosting classifier, the size of a Neural Network's hidden layer or activation functions, the kind of kernel in an SVM, and so on. 


We're trying to figure out what the ideal settings are for our situation. We achieve this by experimenting with various variables and selecting the best. This can be accomplished in a variety of ways. 


A manual search, a grid search, or a more complex optimization might be used. In all of those circumstances, however, we are unable to do so on our training test and on our test set of courses. A third set, a validation set, is required.


We can address all of the difficulties we discussed previously by breaking our data into three sets instead of two, especially if we don't have a lot of data. We can execute all of those stages with a single set by utilising cross-validation.


(Recommended blog: What is Automated Machine Learning (AutoML)?)





Cross validation is a systematic approach to improve a machine learning model and it excels in doing so with the already available data. 


(Also read:  Types of machine learning)


With the different advantages and procedures stated above, it proves that this method is one the easiest and most effective methods in finding errors and also correcting them.

Latest Comments