TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

movie_lens

Description:

This dataset contains a set of movie ratings from the MovieLens website, a movie recommendation service. This dataset was collected and maintained by GroupLens, a research group at the University of Minnesota. There are 5 versions included: "25m", "latest-small", "100k", "1m", "20m". In all datasets, the movies data and ratings data are joined on "movieId". The 25m dataset, latest-small dataset, and 20m dataset contain only movie data and rating data. The 1m dataset and 100k dataset contain demographic data in addition to movie and rating data.

"25m": This is the latest stable version of the MovieLens dataset. It is recommended for research purposes.
"latest-small": This is a small subset of the latest version of the MovieLens dataset. It is changed and updated over time by GroupLens.
"100k": This is the oldest version of the MovieLens datasets. It is a small dataset with demographic data.
"1m": This is the largest MovieLens dataset that contains demographic data.
"20m": This is one of the most used MovieLens datasets in academic papers along with the 1m dataset.

For each version, users can view either only the movies data by adding the "-movies" suffix (e.g. "25m-movies") or the ratings data joined with the movies data (and users data in the 1m and 100k datasets) by adding the "-ratings" suffix (e.g. "25m-ratings").

The features below are included in all versions with the "-ratings" suffix.

"movie_id": a unique identifier of the rated movie
"movie_title": the title of the rated movie with the release year in parentheses
"movie_genres": a sequence of genres to which the rated movie belongs
"user_id": a unique identifier of the user who made the rating
"user_rating": the score of the rating on a five-star scale
"timestamp": the timestamp of the ratings, represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

The "100k-ratings" and "1m-ratings" versions in addition include the following demographic features.

"user_gender": gender of the user who made the rating; a true value corresponds to male
"bucketized_user_age": bucketized age values of the user who made the rating, the values and the corresponding ranges are:
- 1: "Under 18"
- 18: "18-24"
- 25: "25-34"
- 35: "35-44"
- 45: "45-49"
- 50: "50-55"
- 56: "56+"
"user_occupation_label": the occupation of the user who made the rating represented by an integer-encoded label; labels are preprocessed to be consistent across different versions
"user_occupation_text": the occupation of the user who made the rating in the original string; different versions can have different set of raw text labels
"user_zip_code": the zip code of the user who made the rating

In addition, the "100k-ratings" dataset would also have a feature "raw_user_age" which is the exact ages of the users who made the rating

Datasets with the "-movies" suffix contain only "movie_id", "movie_title", and "movie_genres" features.

Homepage: https://grouplens.org/datasets/movielens/
Source code: tfds.structured.MovieLens
Versions:
- 0.1.1 (default): No release notes.
Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Citation:

@article{10.1145/2827872,
author = {Harper, F. Maxwell and Konstan, Joseph A.},
title = {The MovieLens Datasets: History and Context},
year = {2015},
issue_date = {January 2016},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {5},
number = {4},
issn = {2160-6455},
url = {https://doi.org/10.1145/2827872},
doi = {10.1145/2827872},
journal = {ACM Trans. Interact. Intell. Syst.},
month = dec,
articleno = {19},
numpages = {19},
keywords = {Datasets, recommendations, ratings, MovieLens}
}

movie_lens/25m-ratings (default config)

Config description: This dataset contains 25,000,095 ratings across 62,423 movies, created by 162,541 users between January 09, 1995 and November 21,
This dataset is the latest stable version of the MovieLens dataset, generated on November 21, 2019.

Each user has rated at least 20 movies. The ratings are in half-star increments. This dataset does not include demographic data.

Download size: 249.84 MiB
Dataset size: 3.89 GiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'train'`	25,000,095

Feature structure:

FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'timestamp': int64,
    'user_id': string,
    'user_rating': float32,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string
timestamp	Tensor		int64
user_id	Tensor		string
user_rating	Tensor		float32

Examples (tfds.as_dataframe):

movie_lens/25m-movies

Config description: This dataset contains data of 62,423 movies rated in the 25m dataset.
Download size: 249.84 MiB
Dataset size: 5.71 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	62,423

Feature structure:

FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string

Examples (tfds.as_dataframe):

movie_lens/latest-small-ratings

Config description: This dataset contains 100,836 ratings across 9,742 movies, created by 610 users between March 29, 1996 and September 24, 2018. This dataset is generated on September 26, 2018 and is the a subset of the full latest version of the MovieLens dataset. This dataset is changed and updated over time.

Each user has rated at least 20 movies. The ratings are in half-star increments. This dataset does not include demographic data.

Download size: 955.28 KiB
Dataset size: 15.82 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	100,836

Feature structure:

FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'timestamp': int64,
    'user_id': string,
    'user_rating': float32,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string
timestamp	Tensor		int64
user_id	Tensor		string
user_rating	Tensor		float32

Examples (tfds.as_dataframe):

movie_lens/latest-small-movies

Config description: This dataset contains data of 9,742 movies rated in the latest-small dataset.
Download size: 955.28 KiB
Dataset size: 910.64 KiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	9,742

Feature structure:

FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string

Examples (tfds.as_dataframe):

movie_lens/100k-ratings

Config description: This dataset contains 100,000 ratings from 943 users on 1,682 movies. This dataset is the oldest version of the MovieLens dataset.

Each user has rated at least 20 movies. Ratings are in whole-star increments. This dataset contains demographic data of users in addition to data on movies and ratings.

Download size: 4.70 MiB
Dataset size: 32.41 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	100,000

Feature structure:

FeaturesDict({
    'bucketized_user_age': float32,
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'raw_user_age': float32,
    'timestamp': int64,
    'user_gender': bool,
    'user_id': string,
    'user_occupation_label': ClassLabel(shape=(), dtype=int64, num_classes=22),
    'user_occupation_text': string,
    'user_rating': float32,
    'user_zip_code': string,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
bucketized_user_age	Tensor		float32
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string
raw_user_age	Tensor		float32
timestamp	Tensor		int64
user_gender	Tensor		bool
user_id	Tensor		string
user_occupation_label	ClassLabel		int64
user_occupation_text	Tensor		string
user_rating	Tensor		float32
user_zip_code	Tensor		string

Examples (tfds.as_dataframe):

movie_lens/100k-movies

Config description: This dataset contains data of 1,682 movies rated in the 100k dataset.
Download size: 4.70 MiB
Dataset size: 150.35 KiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	1,682

Feature structure:

FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string

Examples (tfds.as_dataframe):

movie_lens/1m-ratings

Config description: This dataset contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in
This dataset is the largest dataset that includes demographic data.

Each user has rated at least 20 movies. Ratings are in whole-star increments. In demographic data, age values are divided into ranges and the lowest age value for each range is used in the data instead of the actual values.

Download size: 5.64 MiB
Dataset size: 308.42 MiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'train'`	1,000,209

Feature structure:

FeaturesDict({
    'bucketized_user_age': float32,
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'timestamp': int64,
    'user_gender': bool,
    'user_id': string,
    'user_occupation_label': ClassLabel(shape=(), dtype=int64, num_classes=22),
    'user_occupation_text': string,
    'user_rating': float32,
    'user_zip_code': string,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
bucketized_user_age	Tensor		float32
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string
timestamp	Tensor		int64
user_gender	Tensor		bool
user_id	Tensor		string
user_occupation_label	ClassLabel		int64
user_occupation_text	Tensor		string
user_rating	Tensor		float32
user_zip_code	Tensor		string

Examples (tfds.as_dataframe):

movie_lens/1m-movies

Config description: This dataset contains data of approximately 3,900 movies rated in the 1m dataset.
Download size: 5.64 MiB
Dataset size: 351.12 KiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	3,883

Feature structure:

FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string

Examples (tfds.as_dataframe):

movie_lens/20m-ratings

Config description: This dataset contains 20,000,263 ratings across 27,278 movies, created by 138,493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.

Each user has rated at least 20 movies. Ratings are in half-star increments. This dataset does not contain demographic data.

Download size: 189.50 MiB
Dataset size: 3.10 GiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'train'`	20,000,263

Feature structure:

FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
    'timestamp': int64,
    'user_id': string,
    'user_rating': float32,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string
timestamp	Tensor		int64
user_id	Tensor		string
user_rating	Tensor		float32

Examples (tfds.as_dataframe):

movie_lens/20m-movies

Config description: This dataset contains data of 27,278 movies rated in the 20m dataset
Download size: 189.50 MiB
Dataset size: 2.55 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	27,278

Feature structure:

FeaturesDict({
    'movie_genres': Sequence(ClassLabel(shape=(), dtype=int64, num_classes=21)),
    'movie_id': string,
    'movie_title': string,
})

Feature documentation:

Feature	Class	Shape	Dtype
	FeaturesDict
movie_genres	Sequence(ClassLabel)	(None,)	int64
movie_id	Tensor		string
movie_title	Tensor		string

Examples (tfds.as_dataframe):