Google I/O is a wrap! Catch up on TensorFlow sessions View sessions

Recommending movies: retrieval using a sequential model

View on TensorFlow.org Run in Google Colab View source on GitHub Download notebook

In this tutorial, we are going to build a sequential retrieval model. Sequential recommendation is a popular model that looks at a sequence of items that users have interacted with previously and then predicts the next item. Here the order of the items within each sequence matters, so we are going to use a recurrent neural network to model the sequential relationship. For more details, please refer to this GRU4Rec paper.

Imports

First let's get our dependencies and imports out of the way.

pip install -q tensorflow-recommenders
pip install -q --upgrade tensorflow-datasets
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

Preparing the dataset

Next, we need to prepare our dataset. We are going to leverage the data generation utility in this TensorFlow Lite On-device Recommendation reference app.

MovieLens 1M data contains ratings.dat (columns: UserID, MovieID, Rating, Timestamp), and movies.dat (columns: MovieID, Title, Genres). The example generation script download the 1M dataset, takes both files, only keep ratings higher than 2, form user movie interaction timelines, sample activities as labels and 10 previous user activities as the context for prediction.

wget -nc https://raw.githubusercontent.com/tensorflow/examples/master/lite/examples/recommendation/ml/data/example_generation_movielens.py
python -m example_generation_movielens  --data_dir=data/raw  --output_dir=data/examples  --min_timeline_length=3  --max_context_length=10  --max_context_movie_genre_length=10  --min_rating=2  --train_data_fraction=0.9  --build_vocabs=False
--2021-12-02 12:10:29--  https://raw.githubusercontent.com/tensorflow/examples/master/lite/examples/recommendation/ml/data/example_generation_movielens.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18040 (18K) [text/plain]
Saving to: ‘example_generation_movielens.py’

example_generation_ 100%[===================>]  17.62K  --.-KB/s    in 0s      

2021-12-02 12:10:29 (107 MB/s) - ‘example_generation_movielens.py’ saved [18040/18040]

I1202 12:10:32.036267 140629273970496 example_generation_movielens.py:460] Downloading and extracting data.
Downloading data from http://files.grouplens.org/datasets/movielens/ml-1m.zip
5922816/5917549 [==============================] - 1s 0us/step
5931008/5917549 [==============================] - 1s 0us/step
I1202 12:10:33.549675 140629273970496 example_generation_movielens.py:406] Reading data to dataframes.
/tmpfs/src/tf_docs_env/lib/python3.7/site-packages/pandas/util/_decorators.py:311: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  return func(*args, **kwargs)
I1202 12:10:37.734699 140629273970496 example_generation_movielens.py:408] Generating movie rating user timelines.
I1202 12:10:40.836473 140629273970496 example_generation_movielens.py:410] Generating train and test examples.
6040/6040 [==============================] - 76s 13ms/step
I1202 12:11:57.162662 140629273970496 example_generation_movielens.py:421] Writing generated training examples.
844195/844195 [==============================] - 14s 17us/step
I1202 12:12:11.266682 140629273970496 example_generation_movielens.py:424] Writing generated testing examples.
93799/93799 [==============================] - 2s 17us/step
I1202 12:12:22.758407 140629273970496 example_generation_movielens.py:473] Generated dataset: {'train_size': 844195, 'test_size': 93799, 'train_file': 'data/examples/train_movielens_1m.tfrecord', 'test_file': 'data/examples/test_movielens_1m.tfrecord'}

Here is a sample of the generated dataset.

0 : {
  features: {
    feature: {
      key  : "context_movie_id"
      value: { int64_list: { value: [ 1124, 2240, 3251, ..., 1268 ] } }
    }
    feature: {
      key  : "context_movie_rating"
      value: { float_list: {value: [ 3.0, 3.0, 4.0, ..., 3.0 ] } }
    }
    feature: {
      key  : "context_movie_year"
      value: { int64_list: { value: [ 1981, 1980, 1985, ..., 1990 ] } }
    }
    feature: {
      key  : "context_movie_genre"
      value: { bytes_list: { value: [ "Drama", "Drama", "Mystery", ..., "UNK" ] } }
    }
    feature: {
      key  : "label_movie_id"
      value: { int64_list: { value: [ 3252 ] }  }
    }
  }
}

You can see that it includes a sequence of context movie IDs, and a label movie ID (next movie), plus context features such as movie year, rating and genre.

In our case we will only be using the sequence of context movie IDs and the label movie ID. You can refer to the Leveraging context features tutorial to learn more about adding additional context features.

train_filename = "./data/examples/train_movielens_1m.tfrecord"
train = tf.data.TFRecordDataset(train_filename)

test_filename = "./data/examples/test_movielens_1m.tfrecord"
test = tf.data.TFRecordDataset(test_filename)

feature_description = {
    'context_movie_id': tf.io.FixedLenFeature([10], tf.int64, default_value=np.repeat(0, 10)),
    'context_movie_rating': tf.io.FixedLenFeature([10], tf.float32, default_value=np.repeat(0, 10)),
    'context_movie_year': tf.io.FixedLenFeature([10], tf.int64, default_value=np.repeat(1980, 10)),
    'context_movie_genre': tf.io.FixedLenFeature([10], tf.string, default_value=np.repeat("Drama", 10)),
    'label_movie_id': tf.io.FixedLenFeature([1], tf.int64, default_value=0),
}

def _parse_function(example_proto):
  return tf.io.parse_single_example(example_proto, feature_description)

train_ds = train.map(_parse_function).map(lambda x: {
    "context_movie_id": tf.strings.as_string(x["context_movie_id"]),
    "label_movie_id": tf.strings.as_string(x["label_movie_id"])
})

test_ds = test.map(_parse_function).map(lambda x: {
    "context_movie_id": tf.strings.as_string(x["context_movie_id"]),
    "label_movie_id": tf.strings.as_string(x["label_movie_id"])
})

for x in train_ds.take(1).as_numpy_iterator():
  pprint.pprint(x)
{'context_movie_id': array([b'2589', b'202', b'1038', b'1767', b'951', b'129', b'1256', b'955',
       b'3097', b'3462'], dtype=object),
 'label_movie_id': array([b'3629'], dtype=object)}

Now our train/test datasets include only a sequence of historical movie IDs and a label of next movie ID. Note that we use [10] as the shape of the features during tf.Example parsing because we specify 10 as the length of context features in the example generateion step.

We need one more thing before we can start building the model - the vocabulary for our movie IDs.

movies = tfds.load("movielens/1m-movies", split='train')
movies = movies.map(lambda x: x["movie_id"])
movie_ids = movies.batch(1_000)
unique_movie_ids = np.unique(np.concatenate(list(movie_ids)))

Implementing a sequential model

In our basic retrieval tutorial, we use one query tower for the user, and the candidate tow for the candidate movie. However, the two-tower architecture is generalizble and not limited to pair. You can also use it to do item-to-item recommendation as we note in the basic retrieval tutorial.

Here we are still going to use the two-tower architecture. Specificially, we use the query tower with a Gated Recurrent Unit (GRU) layer to encode the sequence of historical movies, and keep the same candidate tower for the candidate movie.

embedding_dimension = 32

query_model = tf.keras.Sequential([
    tf.keras.layers.StringLookup(
      vocabulary=unique_movie_ids, mask_token=None),
    tf.keras.layers.Embedding(len(unique_movie_ids) + 1, embedding_dimension), 
    tf.keras.layers.GRU(embedding_dimension),
])

candidate_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_movie_ids, mask_token=None),
  tf.keras.layers.Embedding(len(unique_movie_ids) + 1, embedding_dimension)
])

The metrics, task and full model are defined similar to the basic retrieval model.

metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(candidate_model)
)

task = tfrs.tasks.Retrieval(
  metrics=metrics
)

class Model(tfrs.Model):

    def __init__(self, query_model, candidate_model):
        super().__init__()
        self._query_model = query_model
        self._candidate_model = candidate_model

        self._task = task

    def compute_loss(self, features, training=False):
        watch_history = features["context_movie_id"]
        watch_next_label = features["label_movie_id"]

        query_embedding = self._query_model(watch_history)       
        candidate_embedding = self._candidate_model(watch_next_label)

        return self._task(query_embedding, candidate_embedding, compute_metrics=not training)

Fitting and evaluating

We can now compile, train and evaluate our sequential retrieval model.

model = Model(query_model, candidate_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))
cached_train = train_ds.shuffle(10_000).batch(12800).cache()
cached_test = test_ds.batch(2560).cache()
model.fit(cached_train, epochs=3)
Epoch 1/3
67/67 [==============================] - 25s 291ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 107448.4467 - regularization_loss: 0.0000e+00 - total_loss: 107448.4467
Epoch 2/3
67/67 [==============================] - 2s 25ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 100932.0125 - regularization_loss: 0.0000e+00 - total_loss: 100932.0125
Epoch 3/3
67/67 [==============================] - 2s 25ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_5_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_10_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_50_categorical_accuracy: 0.0000e+00 - factorized_top_k/top_100_categorical_accuracy: 0.0000e+00 - loss: 99336.2015 - regularization_loss: 0.0000e+00 - total_loss: 99336.2015
<keras.callbacks.History at 0x7f0904d5b410>
model.evaluate(cached_test, return_dict=True)
37/37 [==============================] - 10s 235ms/step - factorized_top_k/top_1_categorical_accuracy: 0.0146 - factorized_top_k/top_5_categorical_accuracy: 0.0780 - factorized_top_k/top_10_categorical_accuracy: 0.1358 - factorized_top_k/top_50_categorical_accuracy: 0.3735 - factorized_top_k/top_100_categorical_accuracy: 0.5058 - loss: 15478.0652 - regularization_loss: 0.0000e+00 - total_loss: 15478.0652
{'factorized_top_k/top_1_categorical_accuracy': 0.014605699107050896,
 'factorized_top_k/top_5_categorical_accuracy': 0.07804987579584122,
 'factorized_top_k/top_10_categorical_accuracy': 0.1358330100774765,
 'factorized_top_k/top_50_categorical_accuracy': 0.3735221028327942,
 'factorized_top_k/top_100_categorical_accuracy': 0.5058262944221497,
 'loss': 9413.1240234375,
 'regularization_loss': 0,
 'total_loss': 9413.1240234375}

This concludes the sequential retrieval tutorial.