ML Community Day is November 9! Join us for updates from TensorFlow, JAX, and more Learn more

TensorFlow Addons Optimizers: CyclicalLearningRate

View on TensorFlow.org View source on GitHub Download notebook

Overview

This tutorial demonstrates the use of Cyclical Learning Rate from the Addons package.

Cyclical Learning Rates

It has been shown it is beneficial to adjust the learning rate as training progresses for a neural network. It has manifold benefits ranging from saddle point recovery to preventing numerical instabilities that may arise during backpropagation. But how does one know how much to adjust with respect to a particular training timestamp? In 2015, Leslie Smith noticed that you would want to increase the learning rate to traverse faster across the loss landscape but you would also want to reduce the learning rate when approaching convergence. To realize this idea, he proposed Cyclical Learning Rates (CLR) where you would adjust the learning rate with respect to the cycles of a function. For a visual demonstration, you can check out this blog. CLR is now available as a TensorFlow API. For more details, check out the original paper here.

Setup

pip install -q -U tensorflow_addons
from tensorflow.keras import layers
import tensorflow_addons as tfa
import tensorflow as tf

import numpy as np
import matplotlib.pyplot as plt

tf.random.set_seed(42)
np.random.seed(42)

Load and prepare dataset

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()

x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
8192/5148 [===============================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step

Define hyperparameters

BATCH_SIZE = 64
EPOCHS = 10
INIT_LR = 1e-4
MAX_LR = 1e-2

Define model building and model training utilities

def get_training_model():
    model = tf.keras.Sequential(
        [
            layers.Input((28, 28, 1)),
            layers.experimental.preprocessing.Rescaling(scale=1./255),
            layers.Conv2D(16, (5, 5), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.Conv2D(32, (5, 5), activation="relu"),
            layers.MaxPooling2D(pool_size=(2, 2)),
            layers.SpatialDropout2D(0.2),
            layers.GlobalAvgPool2D(),
            layers.Dense(128, activation="relu"),
            layers.Dense(10, activation="softmax"),
        ]
    )
    return model

def train_model(model, optimizer):
    model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
                       metrics=["accuracy"])
    history = model.fit(x_train,
        y_train,
        batch_size=BATCH_SIZE,
        validation_data=(x_test, y_test),
        epochs=EPOCHS)
    return history

In the interest of reproducibility, the initial model weights are serialized which you will be using to conduct our experiments.

initial_model = get_training_model()
initial_model.save("initial_model")
WARNING:tensorflow:Please add `keras.layers.InputLayer` instead of `keras.Input` to Sequential model. `keras.Input` is intended to be used by Functional model.
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
INFO:tensorflow:Assets written to: initial_model/assets

Train a model without CLR

standard_model = tf.keras.models.load_model("initial_model")
no_clr_history = train_model(standard_model, optimizer="sgd")
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
Epoch 1/10
938/938 [==============================] - 11s 12ms/step - loss: 2.2091 - accuracy: 0.2183 - val_loss: 1.7595 - val_accuracy: 0.4123
Epoch 2/10
938/938 [==============================] - 10s 11ms/step - loss: 1.2961 - accuracy: 0.5130 - val_loss: 0.9587 - val_accuracy: 0.6489
Epoch 3/10
938/938 [==============================] - 10s 11ms/step - loss: 1.0104 - accuracy: 0.6182 - val_loss: 0.9155 - val_accuracy: 0.6576
Epoch 4/10
938/938 [==============================] - 10s 11ms/step - loss: 0.9276 - accuracy: 0.6570 - val_loss: 0.8495 - val_accuracy: 0.7016
Epoch 5/10
938/938 [==============================] - 10s 11ms/step - loss: 0.8856 - accuracy: 0.6719 - val_loss: 0.8399 - val_accuracy: 0.6664
Epoch 6/10
938/938 [==============================] - 10s 11ms/step - loss: 0.8482 - accuracy: 0.6850 - val_loss: 0.7982 - val_accuracy: 0.6818
Epoch 7/10
938/938 [==============================] - 10s 11ms/step - loss: 0.8219 - accuracy: 0.6941 - val_loss: 0.7609 - val_accuracy: 0.7008
Epoch 8/10
938/938 [==============================] - 10s 11ms/step - loss: 0.7996 - accuracy: 0.7011 - val_loss: 0.7267 - val_accuracy: 0.7271
Epoch 9/10
938/938 [==============================] - 10s 11ms/step - loss: 0.7833 - accuracy: 0.7064 - val_loss: 0.7157 - val_accuracy: 0.7450
Epoch 10/10
938/938 [==============================] - 10s 11ms/step - loss: 0.7640 - accuracy: 0.7135 - val_loss: 0.7017 - val_accuracy: 0.7465

Define CLR schedule

The tfa.optimizers.CyclicalLearningRate module return a direct schedule that can be passed to an optimizer. The schedule takes a step as its input and outputs a value calculated using CLR formula as laid out in the paper.

steps_per_epoch = len(x_train) // BATCH_SIZE
clr = tfa.optimizers.CyclicalLearningRate(initial_learning_rate=INIT_LR,
    maximal_learning_rate=MAX_LR,
    scale_fn=lambda x: 1/(2.**(x-1)),
    step_size=2 * steps_per_epoch
)
optimizer = tf.keras.optimizers.SGD(clr)

Here, you specify the lower and upper bounds of the learning rate and the schedule will oscillate in between that range ([1e-4, 1e-2] in this case). scale_fn is used to define the function that would scale up and scale down the learning rate within a given cycle. step_size defines the duration of a single cycle. A step_size of 2 means you need a total of 4 iterations to complete one cycle. The recommended value for step_size is as follows:

factor * steps_per_epoch where factor lies within the [2, 8] range.

In the same CLR paper, Leslie also presented a simple and elegant method to choose the bounds for learning rate. You are encouraged to check it out as well. This blog post provides a nice introduction to the method.

Below, you visualize how the clr schedule looks like.

step = np.arange(0, EPOCHS * steps_per_epoch)
lr = clr(step)
plt.plot(step, lr)
plt.xlabel("Steps")
plt.ylabel("Learning Rate")
plt.show()

png

In order to better visualize the effect of CLR, you can plot the schedule with an increased number of steps.

step = np.arange(0, 100 * steps_per_epoch)
lr = clr(step)
plt.plot(step, lr)
plt.xlabel("Steps")
plt.ylabel("Learning Rate")
plt.show()

png

The function you are using in this tutorial is referred to as the triangular2 method in the CLR paper. There are other two functions there were explored namely triangular and exp (short for exponential).

Train a model with CLR

clr_model = tf.keras.models.load_model("initial_model")
clr_history = train_model(clr_model, optimizer=optimizer)
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
Epoch 1/10
938/938 [==============================] - 11s 11ms/step - loss: 2.3005 - accuracy: 0.1165 - val_loss: 2.2852 - val_accuracy: 0.2377
Epoch 2/10
938/938 [==============================] - 10s 11ms/step - loss: 2.1931 - accuracy: 0.2398 - val_loss: 1.7386 - val_accuracy: 0.4523
Epoch 3/10
938/938 [==============================] - 10s 11ms/step - loss: 1.3131 - accuracy: 0.5053 - val_loss: 1.0110 - val_accuracy: 0.6492
Epoch 4/10
938/938 [==============================] - 10s 11ms/step - loss: 1.0746 - accuracy: 0.5934 - val_loss: 0.9492 - val_accuracy: 0.6621
Epoch 5/10
938/938 [==============================] - 10s 11ms/step - loss: 1.0528 - accuracy: 0.6029 - val_loss: 0.9439 - val_accuracy: 0.6518
Epoch 6/10
938/938 [==============================] - 11s 11ms/step - loss: 1.0197 - accuracy: 0.6170 - val_loss: 0.9096 - val_accuracy: 0.6622
Epoch 7/10
938/938 [==============================] - 10s 11ms/step - loss: 0.9778 - accuracy: 0.6337 - val_loss: 0.8784 - val_accuracy: 0.6748
Epoch 8/10
938/938 [==============================] - 10s 11ms/step - loss: 0.9534 - accuracy: 0.6486 - val_loss: 0.8665 - val_accuracy: 0.6901
Epoch 9/10
938/938 [==============================] - 10s 11ms/step - loss: 0.9510 - accuracy: 0.6497 - val_loss: 0.8690 - val_accuracy: 0.6856
Epoch 10/10
938/938 [==============================] - 11s 11ms/step - loss: 0.9424 - accuracy: 0.6529 - val_loss: 0.8570 - val_accuracy: 0.6918

As expected the loss starts higher than the usual and then it stabilizes as the cycles progress. You can confirm this visually with the plots below.

Visualize losses

(fig, ax) = plt.subplots(2, 1, figsize=(10, 8))

ax[0].plot(no_clr_history.history["loss"], label="train_loss")
ax[0].plot(no_clr_history.history["val_loss"], label="val_loss")
ax[0].set_title("No CLR")
ax[0].set_xlabel("Epochs")
ax[0].set_ylabel("Loss")
ax[0].set_ylim([0, 2.5])
ax[0].legend()

ax[1].plot(clr_history.history["loss"], label="train_loss")
ax[1].plot(clr_history.history["val_loss"], label="val_loss")
ax[1].set_title("CLR")
ax[1].set_xlabel("Epochs")
ax[1].set_ylabel("Loss")
ax[1].set_ylim([0, 2.5])
ax[1].legend()

fig.tight_layout(pad=3.0)
fig.show()

png

Even though for this toy example, you did not see the effects of CLR much but be noted that it is one of the main ingredients behind Super Convergence and can have a really good impact when training in large-scale settings.