Join TensorFlow at Google I/O, May 11-12 Register now

tflite_model_maker.recommendation.DataLoader

Recommendation data loader.

dataset tf.data.Dataset for recommendation.
size int, dataset size.
vocab list of dict, each vocab item is described above.

size Returns the size of the dataset.

Note that this function may return None becuase the exact size of the dataset isn't a necessary parameter to create an instance of this class, and tf.data.Dataset donesn't support a function to get the length directly since it's lazy-loaded and may be infinite. In most cases, however, when an instance of this class is created by helper functions like 'from_folder', the size of the dataset will be preprocessed, and this function can return an int representing the size of the dataset.

Methods

download_and_extract_movielens

View source

Downloads and extracts movielens dataset, then returns extracted dir.

from_movielens

View source

Generates data loader from movielens dataset.

The method downloads and prepares dataset, then generates for train/eval.

For movielens data format, see:

Args
data_dir str, path to dataset containing (unzipped) text data.
data_tag str, specify dataset in {'train', 'test'}.
input_spec InputSpec, specify data format for input and embedding.
generated_examples_dir str, path to generate preprocessed examples. (default: same as data_dir)
min_timeline_length int, min timeline length to split train/eval set.
max_context_length int, max context length as one input.
max_context_movie_genre_length int, max context length of movie genre as one input.
min_rating int or None, include examples with min rating.
train_data_fraction float, percentage of training data [0.0, 1.0].
build_vocabs boolean, whether to build vocabs.
train_filename str, generated file name for training data.
test_filename str, generated file name for test data.
vocab_filename str, generated file name for vocab data.
meta_filename str, generated file name for meta data.

Returns
Data Loader.

gen_dataset

View source

Generates dataset, and overwrites default drop_remainder = True.

generate_movielens_dataset

View source

Generate movielens dataset, and returns a dict contains meta.

Args
data_dir str, path to dataset containing (unzipped) text data.
generated_examples_dir str, path to generate preprocessed examples. (default: same as data_dir)
train_filename str, generated file name for training data.
test_filename str, generated file name for test data.
vocab_filename str, generated file name for vocab data.
meta_filename str, generated file name for meta data.
min_timeline_length int, min timeline length to split train/eval set.
max_context_length int, max context length as one input.
max_context_movie_genre_length int, max context length of movie genre as one input.
min_rating int or None, include examples with min rating.
train_data_fraction float, percentage of training data [0.0, 1.0].
build_vocabs boolean, whether to build vocabs.

Returns
Dict, metadata for the movielens dataset. Containing keys: train_file, train_size, test_file, test_size, vocab_file,vocab_size`, etc.

get_num_classes

View source

Gets number of classes.

0 is reserved. Number of classes is Max Id + 1, e.g., if Max Id = 100, then classes are [0, 100], that is 101 classes in total.

Args
meta dict, containing meta['vocab_max_id'].

Returns
Number of classes.

load_vocab

View source

Loads vocab from file.

The vocab file should be json format of: a list of list[size=4], where the 4 elements are ordered as: [id=int, title=str, genres=str joined with '|', count=int] It is generated when preparing movielens dataset.

Args
vocab_file str, path to vocab file.

Returns
vocab an OrderedDict maps id to item. Each item represents a movie { 'id': int, 'title': str, 'genres': list[str], 'count': int, }

split

View source

Splits dataset into two sub-datasets with the given fraction.

Primarily used for splitting the data set into training and testing sets.

Args
fraction float, demonstrates the fraction of the first returned subdataset in the original data.

Returns
The splitted two sub datasets.

__len__

View source