A common problem with machine learning models used to train data is that they sometimes fail to predict test data while being trained to recognize noise patterns (irrelevant data added in) and need regularization in machine learning before they function well. The model suffers from overfitting. Since the model learns from the training data’s specific noise and patterns, it fails to generalize the unseen data and factor it in. Hence it needs regularization.
In this article let us look at:
Regularization in machine learning terms is to make things acceptable or regular. The process involves the shrinking of data coefficients to tend to zero values. In other words, the process of regularization of the regularization methods in machine learning will discourage overfitting the model, which then learns to be more flexible in a complex environment.
The basic concept of regularization in machine learning is to provide bigger loss values to complex models and award losses to complex models with the introduction of a complexity term.
Take the simple regularization in the logistic regression relationship mentioned below.
Y≈ W_0 W_1 X_1 W_2 X_(2 ) ⋯ W_P X_P
Here, the value to be predicted or learned relation of regularized logistic regression is Y and its deciding features are given by the values of X_1to P etc., and the weights of the features are represented by W_1to P and the bias by W_0.
To fit this regression model, one also requires the regularization parameter in machine learning based on the weights, bias and loss functions to be able to predict the Y value. Linear regression uses RSS or residual sum of squares as its loss function regularization parameter, which can be denoted by
[RSS= the sigmoid function (∑) of _(j=1)^m (Y_i-W_0-∑_(i=1)^n W_i X_ji )^2] which is also called objective of linear regression without regularization.
The algorithm uses the loss function for learning from the training dataset by adjusting the coefficients or weights. When given noisy datasets, it has overfitting issues, and the estimated coefficients produced do not generalize the unseen data, thus needing regularization. This process of regularization in deep learning causes the larger coefficients to be penalized and pushes the estimates learned to zero value.
The 2 regularization techniques in machine learning are the Lasso and Ridge Regression techniques for regularization in machine learning, which are different based on the manner of penalizing the coefficients in the L1 and L2 regularization in machine learning.
The L1 regularization in the machine learning technique modifies the RSS value. It adds a shrinkage quantity or penalty, which is the sum of the coefficient’s absolute values in the sigmoid equation below where estimated coefficients use the modified loss-function.
∑_(j=1)^m (Y_i-W_0-∑_(i=1)^n W_i X_ji )^2 α∑_(i=1)^n |W_i |=RSS α∑_(i=1)^n |W_i |
Lasso Regression or lasso regularization hence uses for normalization of the absolute values of coefficients and hence differs from ridge regression since its loss function is based on the weights or absolute coefficients. The algorithm for optimization will now inflict a penalty on high coefficients in what is called the L1 norm. The value of alpha-α is similar to the ridge regression regularization tuning parameter and is a tradeoff parameter to balance out the RS coefficient’s magnitude.
If α=0, we have a simple linear regression. If α=∞, the coefficient of lasso regression is zero. If values are such that 0<α<∞, the coefficient has a value between 1 and 0. It appears very similar to ridge regression, but let’s have a look at both techniques with a different perspective. In ridge regression, the coefficients or sum of squares of weights is equal or less than s, meaning the equations 2 parameters can be expressed as W_1^2 W_2^2≤s.
Hence, coefficients in ridge regression have the least loss function for all within the circle points of the equation’s solution. In lasso regression, the coefficients are the sum of modulus of weights being equal to or less than s. Then, we get the equation|W_1 | |W_2 |≤s
The coefficients of ridge regression coefficients have the smallest values of loss function for all diamond points lying within the diamond of the equation’s solution.
This regularization in machine learning technique uses L2 regularization or modifies the shrinkage quantity of the RSS through a penalty added in which is given by the square of the coefficient’s magnitude and represented as modified loss function as below
∑_(j=1)^m (Y_i-W_0-∑_(i=1)^n W_i X_ji )^2 α∑_(i=1)^n W_i^2=RSS α∑_(i=1)^n W_i^2
Here, alpha-α is the parameter for shrinkage quantity called the tuning parameter, and this value decides the penalty on the model. ie the emphasis of the tuning parameter is used.
If α=0, the penalty is zero, and we have simple linear regression. If α=∞, the ridge regression coefficient value is zero since the modified loss function’s core loss function is ignored. The square of coefficients becomes minimized, causing the value of the parameter to tend to zero. If 0<α<∞, the value of the coefficient lies between 1 and 0. Hence selecting values of the α coefficient is critical. This ridge regression regularization technique is called the L2 norm.
To improve machine performance, overfitting is prevented using L1 and L2 regularization in machine learning regression modeling techniques imparted to the machine learning model in the 2 discussed above types of regularization in machine learning.
There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this Machine Learning And AI Courses by Jigsaw Academy.