cnn_dailymail
Stay organized with collections
Save and categorize content based on your preferences.
CNN/DailyMail non-anonymized summarization dataset.
There are two features: - article: text of news article, used as the document to
be summarized - highlights: joined text of highlights with and around
each highlight, which is the target summary
Split |
Examples |
'test' |
11,490 |
'train' |
287,113 |
'validation' |
13,368 |
FeaturesDict({
'article': Text(shape=(), dtype=string),
'highlights': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
'publisher': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
article |
Text |
|
string |
|
highlights |
Text |
|
string |
|
id |
Text |
|
string |
|
publisher |
Text |
|
string |
|
@article{DBLP:journals/corr/SeeLM17,
author = {Abigail See and
Peter J. Liu and
Christopher D. Manning},
title = {Get To The Point: Summarization with Pointer-Generator Networks},
journal = {CoRR},
volume = {abs/1704.04368},
year = {2017},
url = {http://arxiv.org/abs/1704.04368},
archivePrefix = {arXiv},
eprint = {1704.04368},
timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
biburl = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{hermann2015teaching,
title={Teaching machines to read and comprehend},
author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
booktitle={Advances in neural information processing systems},
pages={1693--1701},
year={2015}
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2023-01-04 UTC.
[null,null,["Last updated 2023-01-04 UTC."],[],[],null,["# cnn_dailymail\n\n\u003cbr /\u003e\n\n- **Description**:\n\nCNN/DailyMail non-anonymized summarization dataset.\n\nThere are two features: - article: text of news article, used as the document to\nbe summarized - highlights: joined text of highlights with and around\neach highlight, which is the target summary\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/cnn-daily-mail-1)\n\n- **Homepage** :\n \u003chttps://github.com/abisee/cnn-dailymail\u003e\n\n- **Source code** :\n [`tfds.summarization.CnnDailymail`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/summarization/cnn_dailymail.py)\n\n- **Versions**:\n\n - `1.0.0`: New split API (\u003chttps://tensorflow.org/datasets/splits\u003e)\n - `2.0.0`: Separate target sentences with newline. (Having the model\n predict newline separators makes it easier to evaluate using\n summary-level ROUGE.)\n\n - `3.0.0`: Using cased version.\n\n - `3.1.0`: Removed BuilderConfig\n\n - `3.2.0`: Remove extra space before added sentence period. This shouldn't\n affect ROUGE scores because punctuation is removed.\n\n - `3.3.0`: Add publisher feature.\n\n - **`3.4.0`** (default): Add ID feature.\n\n- **Download size** : `558.32 MiB`\n\n- **Dataset size** : `1.29 GiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 11,490 |\n| `'train'` | 287,113 |\n| `'validation'` | 13,368 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'article': Text(shape=(), dtype=string),\n 'highlights': Text(shape=(), dtype=string),\n 'id': Text(shape=(), dtype=string),\n 'publisher': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| article | Text | | string | |\n| highlights | Text | | string | |\n| id | Text | | string | |\n| publisher | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('article', 'highlights')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @article{DBLP:journals/corr/SeeLM17,\n author = {Abigail See and\n Peter J. Liu and\n Christopher D. Manning},\n title = {Get To The Point: Summarization with Pointer-Generator Networks},\n journal = {CoRR},\n volume = {abs/1704.04368},\n year = {2017},\n url = {http://arxiv.org/abs/1704.04368},\n archivePrefix = {arXiv},\n eprint = {1704.04368},\n timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},\n biburl = {https://dblp.org/rec/bib/journals/corr/SeeLM17},\n bibsource = {dblp computer science bibliography, https://dblp.org}\n }\n\n @inproceedings{hermann2015teaching,\n title={Teaching machines to read and comprehend},\n author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},\n booktitle={Advances in neural information processing systems},\n pages={1693--1701},\n year={2015}\n }"]]