- Description:
This dataset contains ~3M messages from reddit. Every message is labeled with metadata. The task is to predict the id of its parent message in the corresponding thread. Each record contains a list of messages from one thread. Duplicated and broken records are removed from the dataset.
Features are:
- id - message id
- text - message text
- author - message author
- created_utc - message UTC timestamp
- link_id - id of the post that the comment relates to
Target:
parent_id - id of the parent message in the current thread
Homepage: https://github.com/henghuiz/MaskedHierarchicalTransformer
Source code:
tfds.datasets.reddit_disentanglement.BuilderVersions:
2.0.0(default): No release notes.
Download size:
Unknown sizeDataset size:
Unknown sizeManual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir(defaults to~/tensorflow_datasets/downloads/manual/):
Download https://github.com/henghuiz/MaskedHierarchicalTransformer, decompress raw_data.zip and run generate_dataset.py with your reddit api credentials. Then put train.csv, val.csv and test.csv from the output directory into the manual folder.Auto-cached (documentation): Unknown
Splits:
| Split | Examples |
|---|
- Feature structure:
FeaturesDict({
'thread': Sequence({
'author': Text(shape=(), dtype=string),
'created_utc': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
'link_id': Text(shape=(), dtype=string),
'parent_id': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
}),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description |
|---|---|---|---|---|
| FeaturesDict | ||||
| thread | Sequence | |||
| thread/author | Text | string | ||
| thread/created_utc | Text | string | ||
| thread/id | Text | string | ||
| thread/link_id | Text | string | ||
| thread/parent_id | Text | string | ||
| thread/text | Text | string |
Supervised keys (See
as_superviseddoc):NoneFigure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe): Missing.
Citation:
@article{zhu2019did,
title={Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer},
author={Zhu, Henghui and Nan, Feng and Wang, Zhiguo and Nallapati, Ramesh and Xiang, Bing},
journal={arXiv preprint arXiv:1911.10666},
year={2019}
}