scientific_papers
Stay organized with collections
Save and categorize content based on your preferences.
Scientific papers datasets contains two sets of long and structured documents.
The datasets are obtained from ArXiv and PubMed OpenAccess repositories.
Both "arxiv" and "pubmed" have two features:
FeaturesDict({
'abstract': Text(shape=(), dtype=string),
'article': Text(shape=(), dtype=string),
'section_names': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
abstract |
Text |
|
string |
|
article |
Text |
|
string |
|
section_names |
Text |
|
string |
|
@article{Cohan_2018,
title={A Discourse-Aware Attention Model for Abstractive Summarization of
Long Documents},
url={http://dx.doi.org/10.18653/v1/n18-2097},
DOI={10.18653/v1/n18-2097},
journal={Proceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language
Technologies, Volume 2 (Short Papers)},
publisher={Association for Computational Linguistics},
author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
year={2018}
}
scientific_papers/arxiv (default config)
Split |
Examples |
'test' |
6,440 |
'train' |
203,037 |
'validation' |
6,436 |
scientific_papers/pubmed
Split |
Examples |
'test' |
6,658 |
'train' |
119,924 |
'validation' |
6,633 |
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-23 UTC.
[null,null,["Last updated 2022-12-23 UTC."],[],[],null,["# scientific_papers\n\n\u003cbr /\u003e\n\n- **Description**:\n\nScientific papers datasets contains two sets of long and structured documents.\nThe datasets are obtained from ArXiv and PubMed OpenAccess repositories.\n\nBoth \"arxiv\" and \"pubmed\" have two features:\n\n- article: the body of the document, pagragraphs seperated by \"/n\".\n- abstract: the abstract of the document, pagragraphs seperated by \"/n\".\n- section_names: titles of sections, seperated by \"/n\".\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/arxiv-summarization-dataset)\n\n- **Homepage** :\n \u003chttps://github.com/armancohan/long-summarization\u003e\n\n- **Source code** :\n [`tfds.datasets.scientific_papers.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/scientific_papers/scientific_papers_dataset_builder.py)\n\n- **Versions**:\n\n - `1.1.0`: No release notes.\n - **`1.1.1`** (default): No release notes.\n- **Download size** : `4.20 GiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Feature structure**:\n\n FeaturesDict({\n 'abstract': Text(shape=(), dtype=string),\n 'article': Text(shape=(), dtype=string),\n 'section_names': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|---------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| abstract | Text | | string | |\n| article | Text | | string | |\n| section_names | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `('article', 'abstract')`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @article{Cohan_2018,\n title={A Discourse-Aware Attention Model for Abstractive Summarization of\n Long Documents},\n url={http://dx.doi.org/10.18653/v1/n18-2097},\n DOI={10.18653/v1/n18-2097},\n journal={Proceedings of the 2018 Conference of the North American Chapter of\n the Association for Computational Linguistics: Human Language\n Technologies, Volume 2 (Short Papers)},\n publisher={Association for Computational Linguistics},\n author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},\n year={2018}\n }\n\nscientific_papers/arxiv (default config)\n----------------------------------------\n\n- **Config description**: Documents from ArXiv repository.\n\n- **Dataset size** : `7.07 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 6,440 |\n| `'train'` | 203,037 |\n| `'validation'` | 6,436 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nscientific_papers/pubmed\n------------------------\n\n- **Config description**: Documents from PubMed repository.\n\n- **Dataset size** : `2.34 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 6,658 |\n| `'train'` | 119,924 |\n| `'validation'` | 6,633 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]