squad
Stay organized with collections
Save and categorize content based on your preferences.
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset,
consisting of questions posed by crowdworkers on a set of Wikipedia articles,
where the answer to every question is a segment of text, or span, from the
corresponding reading passage, or the question might be unanswerable.
@article{2016arXiv160605250R,
author = { {Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},
Konstantin and {Liang}, Percy},
title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
journal = {arXiv e-prints},
year = 2016,
eid = {arXiv:1606.05250},
pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
eprint = {1606.05250},
}
squad/v1.1 (default config)
Split |
Examples |
'train' |
87,599 |
'validation' |
10,570 |
FeaturesDict({
'answers': Sequence({
'answer_start': int32,
'text': Text(shape=(), dtype=string),
}),
'context': Text(shape=(), dtype=string),
'id': string,
'question': Text(shape=(), dtype=string),
'title': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
answers |
Sequence |
|
|
|
answers/answer_start |
Tensor |
|
int32 |
|
answers/text |
Text |
|
string |
|
context |
Text |
|
string |
|
id |
Tensor |
|
string |
|
question |
Text |
|
string |
|
title |
Text |
|
string |
|
squad/v2.0
Config description: Version 2.0.0 of SQUAD
Download size: 44.34 MiB
Dataset size: 148.54 MiB
Auto-cached
(documentation):
Yes (validation), Only when shuffle_files=False
(train)
Splits:
Split |
Examples |
'train' |
130,319 |
'validation' |
11,873 |
FeaturesDict({
'answers': Sequence({
'answer_start': int32,
'text': Text(shape=(), dtype=string),
}),
'context': Text(shape=(), dtype=string),
'id': string,
'is_impossible': bool,
'plausible_answers': Sequence({
'answer_start': int32,
'text': Text(shape=(), dtype=string),
}),
'question': Text(shape=(), dtype=string),
'title': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
answers |
Sequence |
|
|
|
answers/answer_start |
Tensor |
|
int32 |
|
answers/text |
Text |
|
string |
|
context |
Text |
|
string |
|
id |
Tensor |
|
string |
|
is_impossible |
Tensor |
|
bool |
|
plausible_answers |
Sequence |
|
|
|
plausible_answers/answer_start |
Tensor |
|
int32 |
|
plausible_answers/text |
Text |
|
string |
|
question |
Text |
|
string |
|
title |
Text |
|
string |
|
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2023-01-13 UTC.
[null,null,["Last updated 2023-01-13 UTC."],[],[],null,["# squad\n\n\u003cbr /\u003e\n\n- **Description**:\n\nStanford Question Answering Dataset (SQuAD) is a reading comprehension dataset,\nconsisting of questions posed by crowdworkers on a set of Wikipedia articles,\nwhere the answer to every question is a segment of text, or span, from the\ncorresponding reading passage, or the question might be unanswerable.\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/squad)\n\n- **Homepage** :\n \u003chttps://rajpurkar.github.io/SQuAD-explorer/\u003e\n\n- **Source code** :\n [`tfds.datasets.squad.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/squad/squad_dataset_builder.py)\n\n- **Versions**:\n\n - **`3.0.0`** (default): Fixes issue with small number of examples (19) where answer spans are misaligned due to context white-space removal.\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @article{2016arXiv160605250R,\n author = { {Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev},\n Konstantin and {Liang}, Percy},\n title = \"{SQuAD: 100,000+ Questions for Machine Comprehension of Text}\",\n journal = {arXiv e-prints},\n year = 2016,\n eid = {arXiv:1606.05250},\n pages = {arXiv:1606.05250},\n archivePrefix = {arXiv},\n eprint = {1606.05250},\n }\n\nsquad/v1.1 (default config)\n---------------------------\n\n- **Config description**: Version 1.1.0 of SQUAD\n\n- **Download size** : `33.51 MiB`\n\n- **Dataset size** : `94.06 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'train'` | 87,599 |\n| `'validation'` | 10,570 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'answers': Sequence({\n 'answer_start': int32,\n 'text': Text(shape=(), dtype=string),\n }),\n 'context': Text(shape=(), dtype=string),\n 'id': string,\n 'question': Text(shape=(), dtype=string),\n 'title': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|----------------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| answers | Sequence | | | |\n| answers/answer_start | Tensor | | int32 | |\n| answers/text | Text | | string | |\n| context | Text | | string | |\n| id | Tensor | | string | |\n| question | Text | | string | |\n| title | Text | | string | |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nsquad/v2.0\n----------\n\n- **Config description**: Version 2.0.0 of SQUAD\n\n- **Download size** : `44.34 MiB`\n\n- **Dataset size** : `148.54 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes (validation), Only when `shuffle_files=False` (train)\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'train'` | 130,319 |\n| `'validation'` | 11,873 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'answers': Sequence({\n 'answer_start': int32,\n 'text': Text(shape=(), dtype=string),\n }),\n 'context': Text(shape=(), dtype=string),\n 'id': string,\n 'is_impossible': bool,\n 'plausible_answers': Sequence({\n 'answer_start': int32,\n 'text': Text(shape=(), dtype=string),\n }),\n 'question': Text(shape=(), dtype=string),\n 'title': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|--------------------------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| answers | Sequence | | | |\n| answers/answer_start | Tensor | | int32 | |\n| answers/text | Text | | string | |\n| context | Text | | string | |\n| id | Tensor | | string | |\n| is_impossible | Tensor | | bool | |\n| plausible_answers | Sequence | | | |\n| plausible_answers/answer_start | Tensor | | int32 | |\n| plausible_answers/text | Text | | string | |\n| question | Text | | string | |\n| title | Text | | string | |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]