Transfer learning for video classification with MoViNet

View on Run in Google Colab View source on GitHub Download notebook

MoViNets (Mobile Video Networks) provide a family of efficient video classification models, supporting inference on streaming video. In this tutorial, you will use a pre-trained MoViNet model to classify videos, specifically for an action recognition task, from the UCF101 dataset. A pre-trained model is a saved network that was previously trained on a larger dataset. You can find more details about MoViNets in the MoViNets: Mobile Video Networks for Efficient Video Recognition paper by Kondratyuk, D. et al. (2021). In this tutorial, you will:

  • Learn how to download a pre-trained MoViNet model
  • Create a new model using a pre-trained model with a new classifier by freezing the convolutional base of the MoViNet model
  • Replace the classifier head with the number of labels of a new dataset
  • Perform transfer learning on the UCF101 dataset

The model downloaded in this tutorial is from official/projects/movinet. This repository contains a collection of MoViNet models that TF Hub uses in the TensorFlow 2 SavedModel format.

This transfer learning tutorial is the third part in a series of TensorFlow video tutorials. Here are the other three tutorials:

  • Load video data: This tutorial explains much of the code used in this document; in particular, how to preprocess and load data through the FrameGenerator class is explained in more detail.
  • Build a 3D CNN model for video classification. Note that this tutorial uses a (2+1)D CNN that decomposes the spatial and temporal aspects of 3D data; if you are using volumetric data such as an MRI scan, consider using a 3D CNN instead of a (2+1)D CNN.
  • MoViNet for streaming action recognition: Get familiar with the MoViNet models that are available on TF Hub.


Begin by installing and importing some necessary libraries, including: remotezip to inspect the contents of a ZIP file, tqdm to use a progress bar, OpenCV to process video files (ensure that opencv-python and opencv-python-headless are the same version), and TensorFlow models (tf-models-official) to download the pre-trained MoViNet model. The TensorFlow models package are a collection of models that use TensorFlow’s high-level APIs.

pip install remotezip tqdm opencv-python== opencv-python-headless== tf-models-official
import tqdm
import random
import pathlib
import itertools
import collections

import cv2
import numpy as np
import remotezip as rz
import seaborn as sns
import matplotlib.pyplot as plt

import keras
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

# Import the MoViNet model from TensorFlow Models (tf-models-official) for the MoViNet model
from official.projects.movinet.modeling import movinet
from official.projects.movinet.modeling import movinet_model
2022-12-03 03:22:47.217351: W tensorflow/compiler/xla/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2022-12-03 03:22:47.217468: W tensorflow/compiler/xla/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory
2022-12-03 03:22:47.217479: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow_addons/utils/ UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.8.0 and strictly below 2.11.0 (nightly versions are not supported). 
 The versions of TensorFlow you are currently using is 2.11.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:

Load data

The hidden cell below defines helper functions to download a slice of data from the UCF-101 dataset, and load it into a The Loading video data tutorial provides a detailed walkthrough of this code.

The FrameGenerator class at the end of the hidden block is the most important utility here. It creates an iterable object that can feed data into the TensorFlow data pipeline. Specifically, this class contains a Python generator that loads the video frames along with its encoded label. The generator (__call__) function yields the frame array produced by frames_from_video_file and a one-hot encoded vector of the label associated with the set of frames.

URL = ''
download_dir = pathlib.Path('./UCF101_subset/')
subset_paths = download_ufc_101_subset(URL, 
                        num_classes = 10, 
                        splits = {"train": 30, "test": 20}, 
                        download_dir = download_dir)
train :
100%|██████████| 300/300 [00:22<00:00, 13.06it/s]
test :
100%|██████████| 200/200 [00:14<00:00, 13.92it/s]

Create the training and test datasets:

batch_size = 8
num_frames = 8

output_signature = (tf.TensorSpec(shape = (None, None, None, 3), dtype = tf.float32),
                    tf.TensorSpec(shape = (), dtype = tf.int16))

train_ds =['train'], num_frames, training = True),
                                          output_signature = output_signature)
train_ds = train_ds.batch(batch_size)

test_ds =['test'], num_frames),
                                         output_signature = output_signature)
test_ds = test_ds.batch(batch_size)

The labels generated here represent the encoding of the classes. For instance, 'ApplyEyeMakeup' is mapped to the integer Take a look at the labels of the training data to ensure that the dataset has been sufficiently shuffled.

for frames, labels in train_ds.take(10):
tf.Tensor([7 4 2 5 2 7 4 3], shape=(8,), dtype=int16)
tf.Tensor([1 1 6 0 0 5 6 9], shape=(8,), dtype=int16)
tf.Tensor([4 4 1 5 6 5 2 7], shape=(8,), dtype=int16)
tf.Tensor([7 4 0 1 5 6 0 7], shape=(8,), dtype=int16)
tf.Tensor([7 4 7 1 4 7 8 9], shape=(8,), dtype=int16)
tf.Tensor([0 1 2 9 4 6 6 3], shape=(8,), dtype=int16)
tf.Tensor([0 8 9 7 1 6 1 7], shape=(8,), dtype=int16)
tf.Tensor([4 2 1 1 8 0 1 9], shape=(8,), dtype=int16)
tf.Tensor([4 0 8 2 7 9 6 0], shape=(8,), dtype=int16)
tf.Tensor([7 0 9 5 4 6 9 1], shape=(8,), dtype=int16)

Take a look at the shape of the data.

print(f"Shape: {frames.shape}")
print(f"Label: {labels.shape}")
Shape: (8, 8, 224, 224, 3)
Label: (8,)

Download pre-trained MoViNet model

In this section, you will:

  1. You can create a MoViNet model using the open source code provided in official/projects/movinet from TensorFlow models.
  2. Load the pretrained weights.
  3. Freeze the convolutional base, or all other layers except the final classifier head, in order to speed up fine-tuning.

To build the model, you can start with the a0 configuration because it is the fastest to train when benchmarked against other models. Check out the available models to see what might work for your use-case.

model_id = 'a0'
resolution = 224


backbone = movinet.Movinet(model_id=model_id)
backbone.trainable = False

# Set num_classes=600 to load the pre-trained weights from the original model
model = movinet_model.MovinetClassifier(backbone=backbone, num_classes=600)[None, None, None, None, 3])

# Load pre-trained weights
!wget -O movinet_a0_base.tar.gz -q
!tar -xvf movinet_a0_base.tar.gz

checkpoint_dir = f'movinet_{model_id}_base'
checkpoint_path = tf.train.latest_checkpoint(checkpoint_dir)
checkpoint = tf.train.Checkpoint(model=model)
status = checkpoint.restore(checkpoint_path)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/ Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block.
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/ Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block.
<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x7f8e5c76f3a0>

To build a classifier, create a function that takes the backbone and the number of classes in a dataset. The build_classifier function will take the backbone and the number of classes in a dataset in order to build the classifier. In this case, the new classifier will take a num_classes outputs (10 classes for this subset of UCF101).

def build_classifier(batch_size, num_frames, resolution, backbone, num_classes):
  """Builds a classifier on top of a backbone model."""
  model = movinet_model.MovinetClassifier(
      num_classes=num_classes)[batch_size, num_frames, resolution, resolution, 3])

  return model
model = build_classifier(batch_size, num_frames, resolution, backbone, 10)

For this tutorial, choose the tf.keras.optimizers.Adam optimizer and the tf.keras.losses.SparseCategoricalCrossentropy loss function. Use the metrics argument to the view the accuracy of the model performance at every step.

num_epochs = 2

loss_obj = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001)

model.compile(loss=loss_obj, optimizer=optimizer, metrics=['accuracy'])

Train the model. After two epochs, observe a low loss with high accuracy for both the training and test sets.

results =,
Epoch 1/2
38/38 [==============================] - 77s 2s/step - loss: 0.8840 - accuracy: 0.8333 - val_loss: 0.2155 - val_accuracy: 0.9150
Epoch 2/2
38/38 [==============================] - 51s 1s/step - loss: 0.1045 - accuracy: 0.9600 - val_loss: 0.1233 - val_accuracy: 0.9650

Evaluate the model

The model achieved high accuracy on the training dataset. Next, use Keras Model.evaluate to evaluate it on the test set.

model.evaluate(test_ds, return_dict=True)
25/25 [==============================] - 20s 790ms/step - loss: 0.1459 - accuracy: 0.9600
{'loss': 0.14587977528572083, 'accuracy': 0.9599999785423279}

To visualize model performance further, use a confusion matrix. The confusion matrix allows you to assess the performance of the classification model beyond accuracy. In order to build the confusion matrix for this multi-class classification problem, get the actual values in the test set and the predicted values.

def get_actual_predicted_labels(dataset):
    Create a list of actual ground truth values and the predictions from the model.

      dataset: An iterable data structure, such as a TensorFlow Dataset, with features and labels.

      Ground truth and predicted values for a particular dataset.
  actual = [labels for _, labels in dataset.unbatch()]
  predicted = model.predict(dataset)

  actual = tf.stack(actual, axis=0)
  predicted = tf.concat(predicted, axis=0)
  predicted = tf.argmax(predicted, axis=1)

  return actual, predicted
def plot_confusion_matrix(actual, predicted, labels, ds_type):
  cm = tf.math.confusion_matrix(actual, predicted)
  ax = sns.heatmap(cm, annot=True, fmt='g')
  sns.set(rc={'figure.figsize':(12, 12)})
  ax.set_title('Confusion matrix of action recognition for ' + ds_type)
  ax.set_xlabel('Predicted Action')
  ax.set_ylabel('Actual Action')
fg = FrameGenerator(subset_paths['train'], num_frames, training = True)
label_names = list(fg.class_ids_for_name.keys())
actual, predicted = get_actual_predicted_labels(test_ds)
plot_confusion_matrix(actual, predicted, label_names, 'test')
25/25 [==============================] - 24s 747ms/step


Next steps

Now that you have some familiarity with the MoViNet model and how to leverage various TensorFlow APIs (for example, for transfer learning), try using the code in this tutorial with your own dataset. The data does not have to be limited to video data. Volumetric data, such as MRI scans, can also be used with 3D CNNs. The NUSDAT and IMH datasets mentioned in Brain MRI-based 3D Convolutional Neural Networks for Classification of Schizophrenia and Controls could be two such sources for MRI data.

In particular, using the FrameGenerator class used in this tutorial and the other video data and classification tutorials will help you load data into your models.

To learn more about working with video data in TensorFlow, check out the following tutorials: