What Are Different Loss Functions Used as Optimizers in Neural Networks?

  • Rohit Dwivedi
  • Jun 17, 2020
  • Deep Learning
  • Machine Learning
What Are Different Loss Functions Used as Optimizers in Neural Networks? title banner

The final goal in Machine Learning is to increase or decrease the “Objective function”. The loss function is used to measure how good or bad the model is performing. It is used to compute to estimate the prediction given by the model in terms of generalizability.  


For example, we have to identify the dog from a set of dog images. There are more than 100 images of dogs and cats that are mixed in the dataset. For each dog picture the label that is associated is ‘1’ and the picture having no dog present as ‘0’. To solve the problem these images are fed into the network that revert a floating number through which it is predicted that which class the images are related to either 0 or 1. If the outcome is 1 then there is a dog present and vice versa.


But Neural Networks gives us the real number as outcomes such as 0.1, 0.7, and 0.8. And from these sets of numbers it is identified whether 0.1 belongs to the dog or not. Evidently, 0.8 is closer to 1, so if the output is 0.8, the probability is that it is a dog as compared to 0.5. But there would be cases when even the return probability 0.5 or even 0.1 is a dog. Yes, there is a concept of back-propagation for tuning parameters. But before this, the validation techniques need to correct the result with the actual result. In this type of scenario Loss functions come into picture.


Sometimes the activation results become senseless without systematic validation. Also, there is no fixed loss function that can be used at all places. These loss functions depend on a variety of different factors.



Different types of Loss Functions


Loss functions are mainly classified into two different categories that is Classification loss and Regression Loss. Classification loss is the case where the aim is to predict output from the different categorical values for example, if we have a dataset of handwritten images and the digit is to be predicted that lies between (0-9), in these kind of scenarios classification loss is used.


Whereas if the problem is regression like predicting the continuous values for example, if need to predict the weather conditions or predicting the prices of houses on the basis of some features. In this type of cases Regression Loss is used. 


Classification Losses


  • Cross Entropy Loss / Log Loss


It computes the performance of classification tasks where results lie between probability values 0 and 1. As the predicted probability disunites from the true label, cross entropy loss gets increased. Log loss of 0 is considered to be a perfect model. Both cross entropy and log loss are a bit different from each other but when we are computing error between 0 and 1, they result in the same thing.


Check here to know more about cross entropy loss or log loss. 

Graph of Loss when true label is 1.

Cross-Entropy loss/Log Loss


  • Hinge Loss


Another loss for binary classification task is the hinge loss function which was        initially developed to use with the support vector machine models. It is recommended to be used where the target labels are in (-1,1) in binary classification tasks. Hinge loss makes the examples have the right sign, allocating more error when there is dissimilarity in the sign of the true label and predicted label.


Check here to know more about Hinge Loss. 

Hinge Loss with predicted values.

Hinge Loss

  • Square Loss


Hinge loss has many different additional losses. A famous loss is squared hinge loss simply computes the square of the score hinge loss. It makes the error in numerical making easier to work with and smoothens the error. If hinge loss does not give better efficiency then there are chances that square loss might give you reliable performance.

Graph of Square Loss with predicted values.

Square Loss

Check here more about Square Loss. There are other losses also which you can read like focal loss, logistic loss and exponential loss here.



Regression Losses


  • Mean Square Loss / L2 Loss 


It is more often used regression loss that is computed by taking the average squared difference between actual and predicted observation. It mainly takes in consideration the average magnitude of error ignoring the direction. Due to squaring the predictions that are distant from the true values are penalized laboriously in comparison to less diverged predictions. It is easy to compute gradients because of the mathematical properties there in L2 Loss.

Formula to compute mean squared error.

Mean Squared Error

  • Mean Absolute Error


It is computed by taking the average of the sum of absolute differences between the true and predicted variables. Similar to MSE it also calculates magnitude ignoring the direction. It is tough to compute the gradients in MAE as there is a need for linear programming also MAE does not use square so it is more strong to outliers.

Formula to compute mean absolute error.

Mean Absolute Error

  • Mean Bias Error 


It is not that often used loss in regressions. MBE is almost similar to MSE, the only difference that makes them different is that absolute values are not taken here. It is less used but it can be used to check if the model has negative bias or positive bias.

Formula to compute Mean Bias Error

Mean Bias Error

You can check keras documentation of Loss Function where different probilitistic loss function and regression losses are given with their explanation. You can check here for the documentation.




It is very important to check if your model is able to generalize or not. For this purpose we make use of loss function so as to check the performance of the model, how good and how bad the model is performing. 


In this blog, I have discussed what loss functions and different types of loss function used for Classification as well as Regression problems in predictive modelling. Also, there are many other different losses that are used to compute error which can be checked on Keras Documentation where a variety of loss functions are given and discussed that are used in different scenarios.