Action Recognition with an Inflated 3D CNN

View on TensorFlow.org Run in Google Colab View on GitHub Download notebook See TF Hub model

This Colab demonstrates recognizing actions in video data using the tfhub.dev/deepmind/i3d-kinetics-400/1 module. More models to detect actions in videos can be found here.

The underlying model is described in the paper "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset" by Joao Carreira and Andrew Zisserman. The paper was posted on arXiv in May 2017, and was published as a CVPR 2017 conference paper. The source code is publicly available on github.

"Quo Vadis" introduced a new architecture for video classification, the Inflated 3D Convnet or I3D. This architecture achieved state-of-the-art results on the UCF101 and HMDB51 datasets from fine-tuning these models. I3D models pre-trained on Kinetics also placed first in the CVPR 2017 Charades challenge.

The original module was trained on the kinetics-400 dateset and knows about 400 different actions. Labels for these actions can be found in the label map file.

In this Colab we will use it recognize activities in videos from a UCF101 dataset.

Setup

pip install -q imageio
pip install -q opencv-python
pip install -q git+https://github.com/tensorflow/docs

Import the necessary modules

Helper functions for the UCF101 dataset

Get the kinetics-400 labels

Using the UCF101 dataset

# Get the list of videos in the dataset.
ucf_videos = list_ucf_videos()

categories = {}
for video in ucf_videos:
  category = video[2:-12]
  if category not in categories:
    categories[category] = []
  categories[category].append(video)
print("Found %d videos in %d categories." % (len(ucf_videos), len(categories)))

for category, sequences in categories.items():
  summary = ", ".join(sequences[:2])
  print("%-20s %4d videos (%s, ...)" % (category, len(sequences), summary))
# Get a sample cricket video.
video_path = fetch_ucf_video("v_CricketShot_g04_c02.avi")
sample_video = load_video(video_path)
sample_video.shape
i3d = hub.load("https://tfhub.dev/deepmind/i3d-kinetics-400/1").signatures['default']

Run the id3 model and print the top-5 action predictions.

def predict(sample_video):
  # Add a batch axis to the sample video.
  model_input = tf.constant(sample_video, dtype=tf.float32)[tf.newaxis, ...]

  logits = i3d(model_input)['default'][0]
  probabilities = tf.nn.softmax(logits)

  print("Top 5 actions:")
  for i in np.argsort(probabilities)[::-1][:5]:
    print(f"  {labels[i]:22}: {probabilities[i] * 100:5.2f}%")
predict(sample_video)

Now try a new video, from: https://commons.wikimedia.org/wiki/Category:Videos_of_sports

How about this video by Patrick Gillett:

curl -O https://upload.wikimedia.org/wikipedia/commons/8/86/End_of_a_jam.ogv
video_path = "End_of_a_jam.ogv"
sample_video = load_video(video_path)[:100]
sample_video.shape
to_gif(sample_video)
predict(sample_video)