RMSprop: An Understanding In 3 Easy Points

Introduction

RMSprop is an optimization algorithm that is unpublished and designed for neural networks. It is credited to Geoff Hinton. This out of the box algorithm is used as a tool for methods measuring the adaptive learning rate. It can be considered as a rprop algorithm adaptation that initially prompted its development for mini-batch learning.  It can also be considered similar to Adagrad, which uses the RMSprop for its diminishing learning rates. The algorithm is also used as the RMSprop algorithm and the Adam optimizer algorithm in deep learning, neural networks and artificial intelligence applications.

  1. RPROP
  2. Rprop to RMSprop
  3. Similarity with Adagrad

1. RPROP

RPROP has many versions. Take a simple version of full-batch optimizations where the Rprop algorithm is used to solve varying magnitudes in the gradients. Some of these are huge, while other gradients are tiny, causing difficulties in the algorithm’s global learning rate. The algorithm uses the gradient’s sign, ensuring that the weight updates remain the same size. This kind of adjustment in the algorithm helps it deal with tiny gradients, plateaus, saddle points etc.

In fact, increasing learning rates causes the steps taken for large gradients to grow until divergence occurs. Rprop combines the gradient sign with the individual weight’s step size. Thus rather than use the gradient’s magnitude, it uses the particular weight’s step size, which adapts in time, so accelerated learning rates are possible in that direction.

In adjusting the weight for its step size, the algorithm used is

  1. Consider the last 2 gradients where their signs give the weight.
  2. When both signs are the same, it means one is headed in the same direction. To accelerate it a tiny bit, the step size is multiplicatively increased by a factor like 1.2. When the signs are different, it means one increased the step too much and overran a point with the value of local minima. Hence, one uses the factor of 0.5 to decelerate the algorithm and decrease the step size multiplicatively.
  3. The step size is then limited between 2 values which depend on the dataset and application. Typically, good limiting values are 50 for default and 1 millionth.
  4. The weight update is then applied.

2. Rprop to RMSprop

Rprop does take a lot to do updates on mini-batch weights of large datasets because it violates the central idea of a gradient descent that is stochastic. Suppose one has 9 mini-batches with gradient 0.1 and 1 mini-batch with a gradient of -0.9.  The gradients should stay approximately the same while cancelling each other. However, in rprop, the weight is incremented 9 times and decremented once, making the weights grow larger.

Rprop is the equivalent of dividing gradient size to obtain the same gradient, immaterial of how the particular gradient is small or large. One has to use the squared gradients moving average of each weight and divide it by the mean square’s square root to get the RMS meaning root mean square values or RMSprop. The update equation of the RMSprop optimizer is given by

Where E[g] is the squared gradients moving average, dC/dw is the cost function gradient wrt the weight, n is the rate of learning and Beta the parameter of moving averages with value at default being 0.9.

3. Similarity with Adagrad

Adagrad is very similar to RMSprop, the algorithms with an adaptive learning rate. In an Adam vs RMSprop comparison, it adds the gradient’s element-wise scaling depending on each dimension’s historical sum of squares. One runs the gradients sum of squares and adapts the learning rate by using the sum to divide it. 

What happens if one has a high condition number during the scaling with RMSprop? If 2 coordinates have one with big gradients and the other with small gradients, one divides the numbers by a small number to accelerate the small directional movement and decelerate the movement along the larger gradient using the big number.

What happens in algorithm training? Here, the steps are made smaller using the squared gradients updates or dividing by the larger numbers with each step. This is good because, at convex optimization, one slows down as the minima value is approached. In case it is non-convex optimization, one is at a saddle point which the algorithm of Adam RMSprop addresses by using an estimate of the squared gradients known as moving average instead of accumulating them in training.

Conclusion

RMSprop is an algorithm that is popular and fast during optimization. Since there are very many versions of the unpublished algorithm, it is good to use resources like “A Peek at Trends in Machine Learning” by Andrej Karpathy to understand the working of the RMSprop optimization algorithms that are extremely popular in deep learning. One can also study the deep learning optimization processes of RMSprop optimizer TensorFlow using resources like fast.ai, Sebastian Ruder’s blog or  Coursera’s Andrew Ng’s Deep Learning 2nd course. Thus one can see that the RMSprop is the updated algorithm using rprop itself and is also similar to the algorithms used in Adagrad or Adam algorithm.

There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this Machine Learning And AI Courses by Jigsaw Academy.

ALSO READ

Related Articles

loader
Please wait while your application is being created.
Request Callback