movie_lens

  • Description:

This dataset contains a set of movie ratings from the MovieLens website, a movie recommendation service. This dataset was collected and maintained by GroupLens, a research group at the University of Minnesota. There are 5 versions included: "25m", "latest-small", "100k", "1m", "20m". In all datasets, the movies data and ratings data are joined on "movieId". The 25m dataset, latest-small dataset, and 20m dataset contain only movie data and rating data. The 1m dataset and 100k dataset contain demographic data in addition to movie and rating data.

  • "25m": This is the latest stable version of the MovieLens dataset. It is recommended for research purposes.
  • "latest-small": This is a small subset of the latest version of the MovieLens dataset. It is changed and updated over time by GroupLens.
  • "100k": This is the oldest version of the MovieLens datasets. It is a small dataset with demographic data.
  • "1m": This is the largest MovieLens dataset that contains demographic data.
  • "20m": This is one of the most used MovieLens datasets in academic papers along with the 1m dataset.

For each version, users can view either only the movies data by adding the "-movies" suffix (e.g. "25m-movies") or the ratings data joined with the movies data (and users data in the 1m and 100k datasets) by adding the "-ratings" suffix (e.g. "25m-ratings").

The features below are included in all versions with the "-ratings" suffix.

  • "movie_id": a unique identifier of the rated movie
  • "movie_title": the title of the rated movie with the release year in parentheses
  • "movie_genres": a sequence of genres to which the rated movie belongs
  • "user_id": a unique identifier of the user who made the rating
  • "user_rating": the score of the rating on a five-star scale
  • "timestamp": the timestamp of the ratings, represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

The "100k-ratings" and "1m-ratings" versions in addition include the following demographic features.

  • "user_gender": gender of the user who made the rating; a true value corresponds to male
  • "bucketized_user_age": bucketized age values of the user who made the rating, the values and the corresponding ranges are:
    • 1: "Under 18"
    • 18: "18-24"
    • 25: "25-34"
    • 35: "35-44"
    • 45: "45-49"
    • 50: "50-55"
    • 56: "56+"
  • "user_occupation_label": the occupation of the user who made the rating represented by an integer-encoded label; labels are preprocessed to be consistent across different versions
  • "user_occupation_text": the occupation of the user who made the rating in the original string; different versions can have different set of raw text labels
  • "user_zip_code": the zip code of the user who made the rating

In addition, the "100k-ratings" dataset would also have a feature "raw_user_age" which is the exact ages of the users who made the rating

Datasets with the "-movies" suffix contain only "movie_id", "movie_title", and "movie_genres" features.

@article{10.1145/2827872,
author = {Harper, F. Maxwell and Konstan, Joseph A.},
title = {The MovieLens Datasets: History and Context},
year = {2015},
issue_date = {January 2016},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {5},
number = {4},
issn = {2160-6455},
url = {https://doi.org/10.1145/2827872},
doi = {10.1145/2827872},
journal = {ACM Trans. Interact. Intell. Syst.},
month = dec,
articleno = {19},
numpages = {19},
keywords = {Datasets, recommendations, ratings, MovieLens}
}

movie_lens/25m-ratings (default config)

  • Config description: This dataset contains 25,000,095 ratings across 62,423 movies, created by 162,541 users between January 09, 1995 and November 21,
  • This dataset is the latest stable version of the MovieLens dataset, generated on November 21, 2019.

Each user has rated at least 20 movies. The ratings are in half-star increments. This dataset does not include demographic data.

  • Download size: 249.84 MiB

  • Dataset size: 3.89 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'train' 25,000,095
  • Feature structure:
FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'timestamp': int64,
    'user_id': string,
    'user_rating': float32,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string
timestamp Tensor int64
user_id Tensor string
user_rating Tensor float32

movie_lens/25m-movies

  • Config description: This dataset contains data of 62,423 movies rated in the 25m dataset.

  • Download size: 249.84 MiB

  • Dataset size: 5.71 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 62,423
  • Feature structure:
FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string

movie_lens/latest-small-ratings

  • Config description: This dataset contains 100,836 ratings across 9,742 movies, created by 610 users between March 29, 1996 and September 24, 2018. This dataset is generated on September 26, 2018 and is the a subset of the full latest version of the MovieLens dataset. This dataset is changed and updated over time.

Each user has rated at least 20 movies. The ratings are in half-star increments. This dataset does not include demographic data.

  • Download size: 955.28 KiB

  • Dataset size: 15.82 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 100,836
  • Feature structure:
FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'timestamp': int64,
    'user_id': string,
    'user_rating': float32,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string
timestamp Tensor int64
user_id Tensor string
user_rating Tensor float32

movie_lens/latest-small-movies

  • Config description: This dataset contains data of 9,742 movies rated in the latest-small dataset.

  • Download size: 955.28 KiB

  • Dataset size: 910.64 KiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 9,742
  • Feature structure:
FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string

movie_lens/100k-ratings

  • Config description: This dataset contains 100,000 ratings from 943 users on 1,682 movies. This dataset is the oldest version of the MovieLens dataset.

Each user has rated at least 20 movies. Ratings are in whole-star increments. This dataset contains demographic data of users in addition to data on movies and ratings.

  • Download size: 4.70 MiB

  • Dataset size: 32.41 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 100,000
  • Feature structure:
FeaturesDict({
    'bucketized_user_age': float32,
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'raw_user_age': float32,
    'timestamp': int64,
    'user_gender': bool,
    'user_id': string,
    'user_occupation_label': ClassLabel(shape=(), dtype=int64, num_classes=22),
    'user_occupation_text': string,
    'user_rating': float32,
    'user_zip_code': string,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
bucketized_user_age Tensor float32
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string
raw_user_age Tensor float32
timestamp Tensor int64
user_gender Tensor bool
user_id Tensor string
user_occupation_label ClassLabel int64
user_occupation_text Tensor string
user_rating Tensor float32
user_zip_code Tensor string

movie_lens/100k-movies

  • Config description: This dataset contains data of 1,682 movies rated in the 100k dataset.

  • Download size: 4.70 MiB

  • Dataset size: 150.35 KiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 1,682
  • Feature structure:
FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string

movie_lens/1m-ratings

  • Config description: This dataset contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in
  • This dataset is the largest dataset that includes demographic data.

Each user has rated at least 20 movies. Ratings are in whole-star increments. In demographic data, age values are divided into ranges and the lowest age value for each range is used in the data instead of the actual values.

  • Download size: 5.64 MiB

  • Dataset size: 308.42 MiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'train' 1,000,209
  • Feature structure:
FeaturesDict({
    'bucketized_user_age': float32,
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'timestamp': int64,
    'user_gender': bool,
    'user_id': string,
    'user_occupation_label': ClassLabel(shape=(), dtype=int64, num_classes=22),
    'user_occupation_text': string,
    'user_rating': float32,
    'user_zip_code': string,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
bucketized_user_age Tensor float32
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string
timestamp Tensor int64
user_gender Tensor bool
user_id Tensor string
user_occupation_label ClassLabel int64
user_occupation_text Tensor string
user_rating Tensor float32
user_zip_code Tensor string

movie_lens/1m-movies

  • Config description: This dataset contains data of approximately 3,900 movies rated in the 1m dataset.

  • Download size: 5.64 MiB

  • Dataset size: 351.12 KiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 3,883
  • Feature structure:
FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string

movie_lens/20m-ratings

  • Config description: This dataset contains 20,000,263 ratings across 27,278 movies, created by 138,493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.

Each user has rated at least 20 movies. Ratings are in half-star increments. This dataset does not contain demographic data.

  • Download size: 189.50 MiB

  • Dataset size: 3.10 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'train' 20,000,263
  • Feature structure:
FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'timestamp': int64,
    'user_id': string,
    'user_rating': float32,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string
timestamp Tensor int64
user_id Tensor string
user_rating Tensor float32

movie_lens/20m-movies

  • Config description: This dataset contains data of 27,278 movies rated in the 20m dataset

  • Download size: 189.50 MiB

  • Dataset size: 2.55 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 27,278
  • Feature structure:
FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
movie_genres Sequence(ClassLabel) (None,) int64
movie_id Tensor string
movie_title Tensor string