A Review of thePaper: A General and Adaptive Robust Loss Function
Loss functions are at the heart of machine learning algorithms to evaluate the optimization of the model. Loss function is a method that evaluates the model performance. If the model predicts the target variable well, the loss function would output a low score. If the loss function outputs a high score, it means the model predicts the target variable poorly. Most machine learning algorithm loss function in the process of optimization, or find the best parameters or weights for the data. For example, in linear regression, the goal is to find the best fit line that has the least sum of distances from the predicted y, y hat, and the ground truth y, the minimized mean squared error using an optimizer algorithm Gradient Descent.
In this article, I reviewed a paper published by Jonathan T. Barron from Google Research named A General and Adaptive Robust Loss Function.
Abstract of the paper:
We present a generalization of the Cauchy/Lorentzian, Geman-McClure, Welsch/Leclerc, generalized Charbonnier, Charbonnier/pseudo-Huber/L1-L2, and L2 loss functions. By introducing robustness as a continuous parameter, our loss function allows algorithms built around robust loss minimization to be generalized, which improves performance on basic vision tasks such as registration and clustering. Interpreting our loss as the negative log of a univariate density yields a general probability distribution that includes normal and Cauchy distributions as special cases. This probabilistic interpretation enables the training of neural networks in which the robustness of the loss automatically adapts itself during training, which improves performance on learning-based tasks such as generative image synthesis and unsupervised monocular depth estimation, without requiring any manual parameter tuning.
Before discussing this customized loss function, lets’ briefly review fundamental loss functions in machine learning.
There are three regression loss functions: Squared Error Loss, Absolute Error Loss, and Huber Loss.
Squared Error Loss function is a positive quadratic function that penalize on large data points. It is also known as L2 Loss.
L = (y — f(x))²
It is not the best one if the data has many outliers because the squared error is large when the error (predicted y minus actual y) is large. The corresponding cost function for Squared Error Loss is the Mean of the Squared Errors(MSE).
Absolute Error Loss function is a linear function. It is also known as L1 Loss.
L = |y — f(x)|
The corresponding cost function for Absolute Error Loss is the Mean of the Absolute Error, MAE. The MAE cost is more resilient to outliers compared to MSE. The challenging of MAE is the mathematic operation on absolute values.
Huber Loss is a combination of MSE and MAE. It is quadratic for smaller errors and is linear for large errors. Its’ parameter is delta.
If |y — f(x)|≤ delta: L(delta) = 1/2 (y -f(x)².
If |y-f(x)| > delta: L(delta) = delta*|y-f(x)| — 1/2*delta²
Huber Loss is more robust than MSE for outliers.
There are also two binary classification loss functions: Binary Cross Entropy Loss, and Hinge Loss.
Binary Cross Entropy Loss function is also called log -loss. In decision tree, the entropy is introduced as an indicator of disorder or uncertainty. A high degree of entropy always reflects messed up data, or disordered data with low information contained. Higher entropy means less predictive power for data science. Therefore, the Binary Cross Entropy Loss function is to minimize the entropy in a binary classification example.
L = -y * log(p) — (1-y).log(1-p)
y=0: -log(1-p)
y=1: -log(p)
Hinge Loss is primarily used in Support Vector Machine(SVM) classifiers with class labels -1 and 1.
L = max(0, 1-y*f(x))
There are also two Multi Class Classification Loss Functions: Multi-class Cross Entropy Loss and KL-Divergence.
Multi-class Cross Entropy is a generalization of Binary Cross Entropy Loss function.
KL- Divergence is a probability distribution differs from another distribution. A KL-Divergence of zero indicates that the distributions are identical.
Other than these seven fundamental loss functions for regression, binary classification, and multi-class classification, another crucial concept need to review is the robust statistics, for example robust regression. Ordinary Least Squared estimates for regression models are highly sensitive to outliers and heteroscedastic errors. This is the time robust regression is applied to handle these challenges. Methods for robust regression include Least Squares Alternatives, Parametric Alternatives, and Unit Weights.(https://en.wikipedia.org/wiki/Robust_regression).
Based on these basic functions, scientists customize their own loss functions so as to better optimize their machine learning algorithm. In Barron’s paper, a general, adaptive, and robust loss function is presented and applications in examples are provided to demonstrate the advantages of this new customized loss function.
The loss function in Barron’s paper is :
The α ∈ R is a shape parameter that controls the robustness of the loss. and c > 0 is a scale parameter that controls the size of the loss’s quadratic bowl near x = 0.
When α = 2, the function is undefined, but it approaches L2 (squared error) loss function in the limit.
When α = 1 our loss is a smoothed form of L1 loss.
The approaches to L2 and smoothed L1 loss is often referred to as Charbonnier loss, pseudoHuber loss , or L1-L2 loss (as it behaves like L2 loss near the origin and like L1 loss elsewhere). Recalling Huber loss reviewed above, it is a combination of L1 and L2.
When α approaches 0, it yields Cauchy (aka Lorentzian) loss.
When α = −2, our loss reproduces Geman-McClure loss.
In the limit as α approaches negative infinity, the loss becomes Welsch (aka Leclerc ) loss.
When setting α to different values or limit, the customized loss function could be used in different conditions. It is why this loss function is general and adaptive. This loss function is well-suited to gradient — based optimization and convenient for graduated non-convexity . Users can initialize α such that the loss is convex and then gradually reduce α (and therefore reduce convexity and increase robustness) during optimization, thereby enabling robust estimation that (often) avoids local minima (page 3).
There are a few advanced features of this general adaptive robust loss:
- It is smooth.Therefore, it is suited for general gradient descent optimization.
- It increases monotonically with respect to α.This is convenient for graduated non-convexity . The user can initialize α such that the loss is convex and then gradually reduce α (and therefore reduce convexity and increase robustness) during optimization, thereby enabling robust estimation that (often) avoids local minima.
The author also discussed using the loss function to construct a general probability distribution, such that the negative log-likelihood (NLL) of its PDF is a shifted version of this loss function.
The same as the different α values leading to different loss function, the distribution includes several common distributions as special cases. When α = 2 , it is a normal (Gaussian) distribution, and when α = 0, it is a Cauchy distribution.
In the following session, the author provided an experiment using NLL as loss to rain some neural networks models in which the general adaptive robust loss function performs well in the estimation.
The citation below explains why NLL loss was used in the experiments. Basically it says the adjusting a makes the loss function robust to outliers.
“Critically, using the NLL allows us to treat α as a free parameter, thereby allowing optimization to automatically determine the degree of robustness that should be imposed by the loss being used during training. To understand why the NLL must be used for this, consider a training procedure in which we simply minimize ρ (·, α, c) with respect to α and our model weights. In this scenario, the monotonicity of our general loss with respect to α (Eq. 12) means that optimization can trivially minimize the cost of outliers by setting α to be as small as possible. Now consider that same training procedure in which we minimize the NLL of our distribution instead of our loss. As can be observed in Figure 2, reducing α will decrease the NLL of outliers but will increase the NLL of inliers. During training, optimization will have to choose between reducing α, thereby getting “discount” on large errors at the cost of paying a penalty for small errors, or increasing α, thereby incurring a higher cost for outliers but a lower cost for inliers. This tradeoff forces optimization to judiciously adapt the robustness of the NLL being minimized. ”
References:
@misc{barron2017general,
title={A General and Adaptive Robust Loss Function},
author={Jonathan T. Barron},
year={2017},
eprint={1701.03077},
archivePrefix={arXiv},
primaryClass={cs.CV}
}