media_sum

  • Description:

This large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

Please restrict your usage of this dataset to research purpose only.

And please cite our paper: MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization

Ethics

We have used only the publicly available transcripts data from the media sources and adhere to their only-for-research-purpose guideline.

As media and guests may have biased views, the transcripts and summaries will likely contain them. The content of the transcripts and summaries only reflect the views of the media and guests, and should be viewed with discretion.

  • Homepage: https://github.com/zcgzcgzcg1/MediaSum

  • Source code: tfds.summarization.media_sum.MediaSum

  • Versions:

    • 1.0.0 (default): Initial release.
  • Download size: Unknown size

  • Dataset size: 4.11 GiB

  • Manual download instructions: This dataset requires you to download the source data manually into download_config.manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/):
    manual_dir should contain the files:

    • news_dialogue.json
    • train_val_test_split.json

The files can be downloaded and extracted from the dataset's GitHub page: https://github.com/zcgzcgzcg1/MediaSum/tree/main/data

Split Examples
'test' 10,000
'train' 443,596
'val' 10,000
  • Feature structure:
FeaturesDict({
    'date': Text(shape=(), dtype=tf.string),
    'id': Text(shape=(), dtype=tf.string),
    'program': Text(shape=(), dtype=tf.string),
    'speaker': Sequence(Text(shape=(), dtype=tf.string)),
    'summary': Text(shape=(), dtype=tf.string),
    'url': Text(shape=(), dtype=tf.string),
    'utt': Sequence(Text(shape=(), dtype=tf.string)),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
date Text tf.string
id Text tf.string
program Text tf.string
speaker Sequence(Text) (None,) tf.string
summary Text tf.string
url Text tf.string
utt Sequence(Text) (None,) tf.string
  • Citation:
@article{zhu2021mediasum,
  title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},
  author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},
  journal={arXiv preprint arXiv:2103.06410},
  year={2021}
}