TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

reddit_disentanglement

Description:

This dataset contains ~3M messages from reddit. Every message is labeled with metadata. The task is to predict the id of its parent message in the corresponding thread. Each record contains a list of messages from one thread. Duplicated and broken records are removed from the dataset.

Features are:

id - message id
text - message text
author - message author
created_utc - message UTC timestamp
link_id - id of the post that the comment relates to

Target:

parent_id - id of the parent message in the current thread
Homepage: https://github.com/henghuiz/MaskedHierarchicalTransformer
Source code: tfds.datasets.reddit_disentanglement.Builder
Versions:
- 2.0.0 (default): No release notes.
Download size: Unknown size
Dataset size: Unknown size
Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
Download https://github.com/henghuiz/MaskedHierarchicalTransformer, decompress raw_data.zip and run generate_dataset.py with your reddit api credentials. Then put train.csv, val.csv and test.csv from the output directory into the manual folder.
Auto-cached (documentation): Unknown
Splits:

Split	Examples

Feature structure:

FeaturesDict({
    'thread': Sequence({
        'author': Text(shape=(), dtype=string),
        'created_utc': Text(shape=(), dtype=string),
        'id': Text(shape=(), dtype=string),
        'link_id': Text(shape=(), dtype=string),
        'parent_id': Text(shape=(), dtype=string),
        'text': Text(shape=(), dtype=string),
    }),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
thread	Sequence
thread/author	Text	string
thread/created_utc	Text	string
thread/id	Text	string
thread/link_id	Text	string
thread/parent_id	Text	string
thread/text	Text	string

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe): Missing.
Citation:

@article{zhu2019did,
  title={Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer},
  author={Zhu, Henghui and Nan, Feng and Wang, Zhiguo and Nallapati, Ramesh and Xiang, Bing},
  journal={arXiv preprint arXiv:1911.10666},
  year={2019}
}