Stay organized with collections
Save and categorize content based on your preferences.
Description:
This large-scale media interview dataset contains 463.6K transcripts with
abstractive summaries, collected from interview transcripts and overview / topic
descriptions from NPR and CNN.
Please restrict your usage of this dataset to research purpose only.
We have used only the publicly available transcripts data from the media sources
and adhere to their only-for-research-purpose guideline.
As media and guests may have biased views, the transcripts and summaries will
likely contain them. The content of the transcripts and summaries only reflect
the views of the media and guests, and should be viewed with discretion.
Manual download instructions: This dataset requires you to
download the source data manually into download_config.manual_dir
(defaults to ~/tensorflow_datasets/downloads/manual/):
manual_dir should contain the files:
[null,null,["Last updated 2022-12-14 UTC."],[],[],null,["# media_sum\n\n\u003cbr /\u003e\n\n| **Warning:** Manual download required. See instructions below.\n\n- **Description**:\n\nThis large-scale media interview dataset contains 463.6K transcripts with\nabstractive summaries, collected from interview transcripts and overview / topic\ndescriptions from NPR and CNN.\n\n**Please restrict your usage of this dataset to research purpose only.**\n\nAnd please cite our paper:\n**[MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization](https://arxiv.org/abs/2103.06410)**\n\nEthics\n------\n\nWe have used only the publicly available transcripts data from the media sources\nand adhere to their only-for-research-purpose guideline.\n\nAs media and guests may have biased views, the transcripts and summaries will\nlikely contain them. The content of the transcripts and summaries only reflect\nthe views of the media and guests, and should be viewed with discretion.\n\n- **Homepage** :\n \u003chttps://github.com/zcgzcgzcg1/MediaSum\u003e\n\n- **Source code** :\n [`tfds.datasets.media_sum.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/media_sum/media_sum_dataset_builder.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): Initial release.\n- **Download size** : `Unknown size`\n\n- **Dataset size** : `4.11 GiB`\n\n- **Manual download instructions** : This dataset requires you to\n download the source data manually into `download_config.manual_dir`\n (defaults to `~/tensorflow_datasets/downloads/manual/`): \n\n manual_dir should contain the files:\n\n - news_dialogue.json\n - train_val_test_split.json\n\nThe files can be downloaded and extracted from the dataset's GitHub page:\n\u003chttps://github.com/zcgzcgzcg1/MediaSum/tree/main/data\u003e\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'test'` | 10,000 |\n| `'train'` | 443,596 |\n| `'val'` | 10,000 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'date': Text(shape=(), dtype=string),\n 'id': Text(shape=(), dtype=string),\n 'program': Text(shape=(), dtype=string),\n 'speaker': Sequence(Text(shape=(), dtype=string)),\n 'summary': Text(shape=(), dtype=string),\n 'url': Text(shape=(), dtype=string),\n 'utt': Sequence(Text(shape=(), dtype=string)),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|---------|----------------|---------|--------|-------------|\n| | FeaturesDict | | | |\n| date | Text | | string | |\n| id | Text | | string | |\n| program | Text | | string | |\n| speaker | Sequence(Text) | (None,) | string | |\n| summary | Text | | string | |\n| url | Text | | string | |\n| utt | Sequence(Text) | (None,) | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('utt', 'summary')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @article{zhu2021mediasum,\n title={MediaSum: A Large-scale Media Interview Dataset for Dialogue Summarization},\n author={Zhu, Chenguang and Liu, Yang and Mei, Jie and Zeng, Michael},\n journal={arXiv preprint arXiv:2103.06410},\n year={2021}\n }"]]