The performance of a model in Machine Learning is assessed using two key factors. Accuracy and generalisation are two terms that come to mind while discussing accuracy.
Accuracy refers to how well the model predicts the correct target value, whereas generalisation refers to how well the model performs on both known and unknown data.
Approximate a Target Function in Machine Learning
Approximating a target function (f) that translates input variables (X) to an output variable is the best way to understand supervised machine learning (Y).
Y = f(X)
This categorization defines the types of classification and prediction issues that may be addressed, as well as the machine methods that can be employed to solve them.
How effectively the model generalises to new data is an essential factor when learning the target function from the training data. Because the data we obtain is simply a sample, it is incomplete and noisy, generalisation is critical.
(Must read: Machine learning tools)
Definition of generalization
The capacity of a model to adapt to new data is referred to as generalisation. That is, a model can digest fresh data and make correct predictions after being trained on a training set. The capacity of a model to generalise is critical to its success. It will be impossible for a model to generalise if it has been trained successfully on training data.
Even if it can make correct predictions for the training data, it will make erroneous predictions when given fresh data, rendering the model worthless.
When we train a model to distinguish between dogs and cats, for example. If the model is given a dog picture dataset including only two breeds, it may perform well. However, when it is evaluated by other dog breeds, it may receive a poor categorization score. This problem can cause a picture of a dog to be misclassified as a cat from an unknown dataset.
(Also read: Types of learning in machine learning)
As a result, data variety is a critical element in making a successful forecast. In the example above, the model may get an 85 percent performance score when only two dog breeds are tested, and a 70 percent performance score when all breeds are taught.
However, if it is assessed by an unknown dataset with all breed dogs, the first may have a relatively low score (e.g. 45 percent). Given that it has been trained using a large amount of data variety that includes all conceivable breeds, this for the latter may remain unmodified.
When we talk about how well a machine learning model learns and generalises to new data, we use the terms overfitting and underfitting in machine learning.
Reasons for poor Machine learning modelling
When a model learns the information and noise in the training data to the point that it degrades the model's performance on fresh data, this is known as overfitting. This implies that the model picks up on noise or random oscillations in the training data and learns them as ideas. The issue is that these ideas do not apply to fresh data, limiting the models' capacity to generalise.
As we give our model more training, the likelihood of overfitting increases. It indicates that the more we train our model, the more likely it is to become overfitted. Overfitting is the main problem that occurs in supervised learning.
(Related blog: 5 Machine Learning Techniques to Solve Overfitting)
When the method used to create the prediction model is too basic, it is unable to learn complicated patterns from the training data, resulting in underfitting. In that situation, accuracy on both visible training data and unseen test data would be reduced. As a result, it's possible that it won't be able to determine the best match for the data's main trend.
Underfitting occurs when the model is unable to learn enough from the training data, resulting in lower accuracy and inaccurate predictions.
A model that is underfitted has a high bias and a low variance.
(Suggested blog: CART Algorithm)
Assume that three students are preparing for a maths test.
The first student has only studied geography and has not studied political science and history.
The second pupil has an exceptional memory. As a result, the second student has remembered all of the textbook's questions.
The third student, on the other hand, has studied all of the social sciences and is well prepared for the exam.
Students one will only be permitted to answer questions about geography on the test and will fail any problems or questions about other social science subjects.
Student two will only be allowed to respond to questions that appear in the textbook (which he has memorised) and will be unable to respond to any other questions.
Student three will be able to reasonably solve all of the exam questions.
Machine Learning algorithms behave similarly to our three students, with the algorithm's model occasionally resembling that of the first student. They only learn from a tiny portion of the training dataset, which is known as underfitting.
Like the second student, the model will occasionally memorise the whole training dataset. On known cases, they function admirably, but on unseen data or unknown occurrences, they fall short. The model is termed to be Overfitting in such circumstances.
And it's a good fit when the model performs well in both the training dataset and the unseen data or unknown cases, such as student three.
Ways to Prevent Overfitting or Underfitting
Detecting overfitting or underfitting is helpful, but it is not a solution to the problem. Fortunately, you have a variety of alternatives to choose from. A handful of the more common options are shown here.
Underfitting may be remedied by moving on and experimenting with different machine learning methods. Nonetheless, it serves as a nice counterpoint to the issue of overfitting.
Preventing underfitting and overfitting
There are several techniques to avoid overfitting, and a few of them are included here.
Regularization is a technique for lowering model variance by imposing a penalty on the input parameters with the largest coefficients. There are a variety of approaches for reducing noise and outliers in a model, including L1 regularisation, Lasso regularisation, dropout, and so on.
Underfitting occurs when the data characteristics become too consistent, preventing the model from identifying the prevailing trend. More complexity and variance are incorporated into the model when the degree of regularisation is reduced, allowing the model to be successfully trained.
Adding more data
When your model fails to generalise to fresh data, it is overfitting. That is to say, the data it was trained on is not indicative of the data it would encounter in the field. Retraining your algorithm with a larger, richer, and more varied data set should help it perform better.
Unfortunately, obtaining more data might be challenging due to the high cost of collecting it or the fact that only a few samples are created on a regular basis. It could be a good idea to utilise data augmentation in that scenario.
Use Data augmentation
Data augmentation, which is less expensive than training with extra data, is an alternative to the former. You can make the current data sets look varied if you are unable to acquire new data on a regular basis.
Data augmentation alters the appearance of a data sample each time it is processed by the model. The procedure makes each data set look unique to the model and prevents it from learning the data sets' features.
(Related reading: Cost function in machine learning)
Increase the duration of training
As previously noted, ending training too soon might lead to an underfit model. As a result, it can be prevented by prolonging the training period. However, it's critical to avoid overtraining and, as a result, overfitting. It will be crucial to strike a balance between the two possibilities.
Select Specific features
Specific characteristics are utilised to determine a specified outcome in any model. If there aren't enough predictive features, new features, or features that are more important, should be added.
You could, for example, add additional hidden neurons to a neural network or more trees to a random forest. This procedure will add complexity to the model, resulting in improved training outcomes. (from)
Overfitting is a modelling mistake in which the model creates bias by being too closely linked to the data set. Overfitting limits the model's usefulness to its own data set and renders it useless to other data sets.Ensembling, data augmentation, data simplification, and cross-validation are some of the approaches used to avoid overfitting.