salient_span_wikipedia
Stay organized with collections
Save and categorize content based on your preferences.
Wikipedia sentences with labeled salient spans.
@article{guu2020realm,
title={REALM: Retrieval-Augmented Language Model Pre-Training},
author={Kelvin Guu and Kenton Lee and Zora Tung and Panupong Pasupat and Ming-Wei Chang},
year={2020},
journal = {arXiv e-prints},
archivePrefix = {arXiv},
eprint={2002.08909},
}
salient_span_wikipedia/sentences (default config)
Split |
Examples |
'train' |
82,291,706 |
FeaturesDict({
'spans': Sequence({
'limit': int32,
'start': int32,
'type': string,
}),
'text': Text(shape=(), dtype=string),
'title': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
spans |
Sequence |
|
|
|
spans/limit |
Tensor |
|
int32 |
|
spans/start |
Tensor |
|
int32 |
|
spans/type |
Tensor |
|
string |
|
text |
Text |
|
string |
|
title |
Text |
|
string |
|
salient_span_wikipedia/documents
Split |
Examples |
'train' |
13,353,718 |
FeaturesDict({
'sentences': Sequence({
'limit': int32,
'start': int32,
}),
'spans': Sequence({
'limit': int32,
'start': int32,
'type': string,
}),
'text': Text(shape=(), dtype=string),
'title': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
sentences |
Sequence |
|
|
|
sentences/limit |
Tensor |
|
int32 |
|
sentences/start |
Tensor |
|
int32 |
|
spans |
Sequence |
|
|
|
spans/limit |
Tensor |
|
int32 |
|
spans/start |
Tensor |
|
int32 |
|
spans/type |
Tensor |
|
string |
|
text |
Text |
|
string |
|
title |
Text |
|
string |
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-23 UTC.
[null,null,["Last updated 2022-12-23 UTC."],[],[],null,["# salient_span_wikipedia\n\n\u003cbr /\u003e\n\n- **Description**:\n\nWikipedia sentences with labeled salient spans.\n\n- **Homepage** :\n \u003chttps://www.tensorflow.org/datasets/catalog/salient_span_wikipedia\u003e\n\n- **Source code** :\n [`tfds.datasets.salient_span_wikipedia.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/salient_span_wikipedia/salient_span_wikipedia_dataset_builder.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): No release notes.\n- **Download size** : `Unknown size`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @article{guu2020realm,\n title={REALM: Retrieval-Augmented Language Model Pre-Training},\n author={Kelvin Guu and Kenton Lee and Zora Tung and Panupong Pasupat and Ming-Wei Chang},\n year={2020},\n journal = {arXiv e-prints},\n archivePrefix = {arXiv},\n eprint={2002.08909},\n }\n\nsalient_span_wikipedia/sentences (default config)\n-------------------------------------------------\n\n- **Config description**: Examples are individual sentences containing\n entities.\n\n- **Dataset size** : `20.57 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|------------|\n| `'train'` | 82,291,706 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'spans': Sequence({\n 'limit': int32,\n 'start': int32,\n 'type': string,\n }),\n 'text': Text(shape=(), dtype=string),\n 'title': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|-------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| spans | Sequence | | | |\n| spans/limit | Tensor | | int32 | |\n| spans/start | Tensor | | int32 | |\n| spans/type | Tensor | | string | |\n| text | Text | | string | |\n| title | Text | | string | |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nsalient_span_wikipedia/documents\n--------------------------------\n\n- **Config description**: Examples re full documents.\n\n- **Dataset size** : `16.52 GiB`\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|------------|\n| `'train'` | 13,353,718 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'sentences': Sequence({\n 'limit': int32,\n 'start': int32,\n }),\n 'spans': Sequence({\n 'limit': int32,\n 'start': int32,\n 'type': string,\n }),\n 'text': Text(shape=(), dtype=string),\n 'title': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|-----------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| sentences | Sequence | | | |\n| sentences/limit | Tensor | | int32 | |\n| sentences/start | Tensor | | int32 | |\n| spans | Sequence | | | |\n| spans/limit | Tensor | | int32 | |\n| spans/start | Tensor | | int32 | |\n| spans/type | Tensor | | string | |\n| text | Text | | string | |\n| title | Text | | string | |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]