paws_wiki
Stay organized with collections
Save and categorize content based on your preferences.
Existing paraphrase identification datasets lack sentence pairs that have high
lexical overlap without being paraphrases. Models trained on such data fail to
distinguish pairs like flights from New York to Florida and flights from Florida
to New York. This dataset contains 108,463 human-labeled and 656k noisily
labeled pairs that feature the importance of modeling structure, context, and
word order information for the problem of paraphrase identification.
For further details, see the accompanying paper: PAWS: Paraphrase Adversaries
from Word Scrambling at https://arxiv.org/abs/1904.01130
This corpus contains pairs generated from Wikipedia pages, containing pairs that
are generated from both word swapping and back translation methods. All pairs
have human judgements on both paraphrasing and fluency and they are split into
Train/Dev/Test sections.
All files are in the tsv format with four columns:
id
: A unique id for each pair.
sentence1
: The first sentence.
sentence2
: The second sentence.
(noisy_)label
: (Noisy) label for each pair.
Each label has two possible values: 0 indicates the pair has different meaning,
while 1 indicates the pair is a paraphrase.
FeaturesDict({
'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
'sentence1': Text(shape=(), dtype=string),
'sentence2': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
label |
ClassLabel |
|
int64 |
|
sentence1 |
Text |
|
string |
|
sentence2 |
Text |
|
string |
|
@InProceedings{paws2019naacl,
title = { {PAWS: Paraphrase Adversaries from Word Scrambling} },
author = {Zhang, Yuan and Baldridge, Jason and He, Luheng},
booktitle = {Proc. of NAACL},
year = {2019}
}
paws_wiki/labeled_final_tokenized (default config)
Split |
Examples |
'test' |
8,000 |
'train' |
49,401 |
'validation' |
8,000 |
paws_wiki/labeled_final_raw
Split |
Examples |
'test' |
8,000 |
'train' |
49,401 |
'validation' |
8,000 |
paws_wiki/labeled_swap_tokenized
Split |
Examples |
'train' |
30,397 |
paws_wiki/labeled_swap_raw
Split |
Examples |
'train' |
30,397 |
paws_wiki/unlabeled_final_tokenized
Config description: Subset: unlabeled_final tokenized: True
Dataset size: 177.89 MiB
Auto-cached
(documentation):
Yes (validation), Only when shuffle_files=False
(train)
Splits:
Split |
Examples |
'train' |
645,652 |
'validation' |
10,000 |
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-15 UTC.
[null,null,["Last updated 2022-12-15 UTC."],[],[],null,["# paws_wiki\n\n\u003cbr /\u003e\n\n- **Description**:\n\nExisting paraphrase identification datasets lack sentence pairs that have high\nlexical overlap without being paraphrases. Models trained on such data fail to\ndistinguish pairs like flights from New York to Florida and flights from Florida\nto New York. This dataset contains 108,463 human-labeled and 656k noisily\nlabeled pairs that feature the importance of modeling structure, context, and\nword order information for the problem of paraphrase identification.\n\nFor further details, see the accompanying paper: PAWS: Paraphrase Adversaries\nfrom Word Scrambling at \u003chttps://arxiv.org/abs/1904.01130\u003e\n\nThis corpus contains pairs generated from Wikipedia pages, containing pairs that\nare generated from both word swapping and back translation methods. All pairs\nhave human judgements on both paraphrasing and fluency and they are split into\nTrain/Dev/Test sections.\n\nAll files are in the tsv format with four columns:\n\n1. `id`: A unique id for each pair.\n2. `sentence1`: The first sentence.\n3. `sentence2`: The second sentence.\n4. `(noisy_)label`: (Noisy) label for each pair.\n\nEach label has two possible values: 0 indicates the pair has different meaning,\nwhile 1 indicates the pair is a paraphrase.\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/paws)\n\n- **Homepage** :\n \u003chttps://github.com/google-research-datasets/paws\u003e\n\n- **Source code** :\n [`tfds.datasets.paws_wiki.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/paws_wiki/paws_wiki_dataset_builder.py)\n\n- **Versions**:\n\n - `1.0.0`: Initial version.\n - **`1.1.0`** (default): Adds configs to different subset and support raw text.\n- **Download size** : `57.47 MiB`\n\n- **Feature structure**:\n\n FeaturesDict({\n 'label': ClassLabel(shape=(), dtype=int64, num_classes=2),\n 'sentence1': Text(shape=(), dtype=string),\n 'sentence2': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|-----------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| label | ClassLabel | | int64 | |\n| sentence1 | Text | | string | |\n| sentence2 | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @InProceedings{paws2019naacl,\n title = { {PAWS: Paraphrase Adversaries from Word Scrambling} },\n author = {Zhang, Yuan and Baldridge, Jason and He, Luheng},\n booktitle = {Proc. of NAACL},\n year = {2019}\n }\n\npaws_wiki/labeled_final_tokenized (default config)\n--------------------------------------------------\n\n- **Config description**: Subset: labeled_final tokenized: True\n\n- **Dataset size** : `17.96 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 8,000 |\n| `'train'` | 49,401 |\n| `'validation'` | 8,000 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_wiki/labeled_final_raw\n---------------------------\n\n- **Config description**: Subset: labeled_final tokenized: False\n\n- **Dataset size** : `17.57 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 8,000 |\n| `'train'` | 49,401 |\n| `'validation'` | 8,000 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_wiki/labeled_swap_tokenized\n--------------------------------\n\n- **Config description**: Subset: labeled_swap tokenized: True\n\n- **Dataset size** : `8.79 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'train'` | 30,397 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_wiki/labeled_swap_raw\n--------------------------\n\n- **Config description**: Subset: labeled_swap tokenized: False\n\n- **Dataset size** : `8.60 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|----------|\n| `'train'` | 30,397 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_wiki/unlabeled_final_tokenized\n-----------------------------------\n\n- **Config description**: Subset: unlabeled_final tokenized: True\n\n- **Dataset size** : `177.89 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes (validation), Only when `shuffle_files=False` (train)\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'train'` | 645,652 |\n| `'validation'` | 10,000 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]