- Description:
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
Additional Documentation: Explore on Papers With Code
Source code:
tfds.datasets.imdb_reviews.Builder
Versions:
1.0.0
(default): New split API (https://tensorflow.org/datasets/splits)
Supervised keys (See
as_supervised
doc):('text', 'label')
Figure (tfds.show_examples): Not supported.
Citation:
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
imdb_reviews/plain_text (default config)
Config description: Plain text
Download size:
80.23 MiB
Dataset size:
129.83 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'test' |
25,000 |
'train' |
25,000 |
'unsupervised' |
50,000 |
- Feature structure:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
'text': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
label | ClassLabel | int64 | ||
text | Text | string |
- Examples (tfds.as_dataframe):
imdb_reviews/bytes
Config description: Uses byte-level text encoding with
tfds.deprecated.text.ByteTextEncoder
Download size:
Unknown size
Dataset size:
Unknown size
Auto-cached (documentation): Unknown
Splits:
Split | Examples |
---|
- Feature structure:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
'text': Text(shape=(None,), dtype=int64, encoder=<ByteTextEncoder vocab_size=257>),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
label | ClassLabel | int64 | ||
text | Text | (None,) | int64 |
- Examples (tfds.as_dataframe):
imdb_reviews/subwords8k
Config description: Uses
tfds.deprecated.text.SubwordTextEncoder
with 8k vocab sizeDownload size:
Unknown size
Dataset size:
Unknown size
Auto-cached (documentation): Unknown
Splits:
Split | Examples |
---|
- Feature structure:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
'text': Text(shape=(None,), dtype=int64),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
label | ClassLabel | int64 | ||
text | Text | (None,) | int64 |
- Examples (tfds.as_dataframe):
imdb_reviews/subwords32k
Config description: Uses
tfds.deprecated.text.SubwordTextEncoder
with 32k vocab sizeDownload size:
Unknown size
Dataset size:
Unknown size
Auto-cached (documentation): Unknown
Splits:
Split | Examples |
---|
- Feature structure:
FeaturesDict({
'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
'text': Text(shape=(None,), dtype=int64),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
label | ClassLabel | int64 | ||
text | Text | (None,) | int64 |
- Examples (tfds.as_dataframe):