reddit_disentanglement

  • Description:

This dataset contains ~3M messages from reddit. Every message is labeled with metadata. The task is to predict the id of its parent message in the corresponding thread. Each record contains a list of messages from one thread. Duplicated and broken records are removed from the dataset.

Features are:

  • id - message id
  • text - message text
  • author - message author
  • created_utc - message UTC timestamp
  • link_id - id of the post that the comment relates to

Target:

Split Examples
  • Feature structure:
FeaturesDict({
    'thread': Sequence({
        'author': Text(shape=(), dtype=string),
        'created_utc': Text(shape=(), dtype=string),
        'id': Text(shape=(), dtype=string),
        'link_id': Text(shape=(), dtype=string),
        'parent_id': Text(shape=(), dtype=string),
        'text': Text(shape=(), dtype=string),
    }),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
thread Sequence
thread/author Text string
thread/created_utc Text string
thread/id Text string
thread/link_id Text string
thread/parent_id Text string
thread/text Text string
@article{zhu2019did,
  title={Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer},
  author={Zhu, Henghui and Nan, Feng and Wang, Zhiguo and Nallapati, Ramesh and Xiang, Bing},
  journal={arXiv preprint arXiv:1911.10666},
  year={2019}
}