- Description:
Reddit dataset, where TIFU denotes the name of subbreddit /r/tifu. As defined in the publication, style "short" uses title as summary and "long" uses tldr as summary.
Features includes:
- document: post text without tldr.
- tldr: tldr line.
- title: trimmed title without tldr.
- ups: upvotes.
- score: score.
- num_comments: number of comments.
upvote_ratio: upvote ratio.
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/ctr4si/MMN
Source code:
tfds.datasets.reddit_tifu.BuilderVersions:
1.1.0: Remove empty document and summary strings.1.1.1: Add train, dev and test (80/10/10) splits which are used in PEGASUS (https://arxiv.org/abs/1912.08777) in a separate config. These were created randomly using the tfds split function and are being released to ensure that results on Reddit Tifu Long are reproducible and comparable.Also addidto the datapoints.1.1.2(default): Corrected splits uploaded.
Feature structure:
FeaturesDict({
'documents': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
'num_comments': float32,
'score': float32,
'title': Text(shape=(), dtype=string),
'tldr': Text(shape=(), dtype=string),
'ups': float32,
'upvote_ratio': float32,
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description |
|---|---|---|---|---|
| FeaturesDict | ||||
| documents | Text | string | ||
| id | Text | string | ||
| num_comments | Tensor | float32 | ||
| score | Tensor | float32 | ||
| title | Text | string | ||
| tldr | Text | string | ||
| ups | Tensor | float32 | ||
| upvote_ratio | Tensor | float32 |
Figure (tfds.show_examples): Not supported.
Citation:
@misc{kim2018abstractive,
title={Abstractive Summarization of Reddit Posts with Multi-level Memory Networks},
author={Byeongchang Kim and Hyunwoo Kim and Gunhee Kim},
year={2018},
eprint={1811.00783},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
reddit_tifu/short (default config)
Config description: Using title as summary.
Download size:
639.54 MiBDataset size:
141.46 MiBAuto-cached (documentation): Only when
shuffle_files=False(train)Splits:
| Split | Examples |
|---|---|
'train' |
79,740 |
Supervised keys (See
as_superviseddoc):('documents', 'title')Examples (tfds.as_dataframe):
reddit_tifu/long
Config description: Using TLDR as summary.
Download size:
639.54 MiBDataset size:
93.10 MiBAuto-cached (documentation): Yes
Splits:
| Split | Examples |
|---|---|
'train' |
42,139 |
Supervised keys (See
as_superviseddoc):('documents', 'tldr')Examples (tfds.as_dataframe):
reddit_tifu/long_split
Config description: Using TLDR as summary and return train/test/dev splits.
Download size:
639.94 MiBDataset size:
93.10 MiBAuto-cached (documentation): Yes
Splits:
| Split | Examples |
|---|---|
'test' |
4,214 |
'train' |
33,711 |
'validation' |
4,214 |
Supervised keys (See
as_superviseddoc):('documents', 'tldr')Examples (tfds.as_dataframe):