paws_x_wiki
Stay organized with collections
Save and categorize content based on your preferences.
This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406
machine translated training pairs in six typologically distinct languages:
- French
- Spanish
- German
- Chinese
- Japanese
- Korean
For further details, see the accompanying paper: PAWS-X: A Cross-lingual
Adversarial Dataset for Paraphrase Identification at
https://arxiv.org/abs/1908.11828
Similar to PAWS Dataset, examples are split into Train/Dev/Test sections. All
files are in the tsv format with four columns:
id
: A unique id for each pair.
sentence1
: The first sentence.
sentence2
: The second sentence.
(noisy_)label
: (Noisy) label for each pair.
Each label has two possible values: 0 indicates the pair has different meaning,
while 1 indicates the pair is a paraphrase.
FeaturesDict({
'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
'sentence1': Text(shape=(), dtype=string),
'sentence2': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
label |
ClassLabel |
|
int64 |
|
sentence1 |
Text |
|
string |
|
sentence2 |
Text |
|
string |
|
@InProceedings{pawsx2019emnlp,
title = { {PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification} },
author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
booktitle = {Proc. of EMNLP},
year = {2019}
}
paws_x_wiki/de (default config)
Split |
Examples |
'test' |
2,000 |
'train' |
49,380 |
'validation' |
2,000 |
paws_x_wiki/en
Split |
Examples |
'test' |
2,000 |
'train' |
49,175 |
'validation' |
2,000 |
paws_x_wiki/es
Split |
Examples |
'test' |
2,000 |
'train' |
49,401 |
'validation' |
1,961 |
paws_x_wiki/fr
Split |
Examples |
'test' |
2,000 |
'train' |
49,399 |
'validation' |
1,988 |
paws_x_wiki/ja
Split |
Examples |
'test' |
2,000 |
'train' |
49,401 |
'validation' |
2,000 |
paws_x_wiki/ko
Split |
Examples |
'test' |
1,999 |
'train' |
49,164 |
'validation' |
2,000 |
paws_x_wiki/zh
Split |
Examples |
'test' |
2,000 |
'train' |
49,401 |
'validation' |
2,000 |
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-15 UTC.
[null,null,["Last updated 2022-12-15 UTC."],[],[],null,["# paws_x_wiki\n\n\u003cbr /\u003e\n\n- **Description**:\n\nThis dataset contains 23,659 human translated PAWS evaluation pairs and 296,406\nmachine translated training pairs in six typologically distinct languages:\n\n- French\n- Spanish\n- German\n- Chinese\n- Japanese\n- Korean\n\nFor further details, see the accompanying paper: PAWS-X: A Cross-lingual\nAdversarial Dataset for Paraphrase Identification at\n\u003chttps://arxiv.org/abs/1908.11828\u003e\n\nSimilar to PAWS Dataset, examples are split into Train/Dev/Test sections. All\nfiles are in the tsv format with four columns:\n\n1. `id`: A unique id for each pair.\n2. `sentence1`: The first sentence.\n3. `sentence2`: The second sentence.\n4. `(noisy_)label`: (Noisy) label for each pair.\n\nEach label has two possible values: 0 indicates the pair has different meaning,\nwhile 1 indicates the pair is a paraphrase.\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/paws-x)\n\n- **Homepage** :\n \u003chttps://github.com/google-research-datasets/paws/tree/master/pawsx\u003e\n\n- **Source code** :\n [`tfds.datasets.paws_x_wiki.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/paws_x_wiki/paws_x_wiki_dataset_builder.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): No release notes.\n- **Download size** : `28.88 MiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Feature structure**:\n\n FeaturesDict({\n 'label': ClassLabel(shape=(), dtype=int64, num_classes=2),\n 'sentence1': Text(shape=(), dtype=string),\n 'sentence2': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|-----------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| label | ClassLabel | | int64 | |\n| sentence1 | Text | | string | |\n| sentence2 | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @InProceedings{pawsx2019emnlp,\n title = { {PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification} },\n author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},\n booktitle = {Proc. of EMNLP},\n year = {2019}\n }\n\npaws_x_wiki/de (default config)\n-------------------------------\n\n- **Config description**: Translated to de\n\n- **Dataset size** : `15.27 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 2,000 |\n| `'train'` | 49,380 |\n| `'validation'` | 2,000 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_x_wiki/en\n--------------\n\n- **Config description**: Translated to en\n\n- **Dataset size** : `14.59 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 2,000 |\n| `'train'` | 49,175 |\n| `'validation'` | 2,000 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_x_wiki/es\n--------------\n\n- **Config description**: Translated to es\n\n- **Dataset size** : `15.27 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 2,000 |\n| `'train'` | 49,401 |\n| `'validation'` | 1,961 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_x_wiki/fr\n--------------\n\n- **Config description**: Translated to fr\n\n- **Dataset size** : `15.79 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 2,000 |\n| `'train'` | 49,399 |\n| `'validation'` | 1,988 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_x_wiki/ja\n--------------\n\n- **Config description**: Translated to ja\n\n- **Dataset size** : `17.77 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 2,000 |\n| `'train'` | 49,401 |\n| `'validation'` | 2,000 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_x_wiki/ko\n--------------\n\n- **Config description**: Translated to ko\n\n- **Dataset size** : `16.42 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 1,999 |\n| `'train'` | 49,164 |\n| `'validation'` | 2,000 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\npaws_x_wiki/zh\n--------------\n\n- **Config description**: Translated to zh\n\n- **Dataset size** : `13.20 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 2,000 |\n| `'train'` | 49,401 |\n| `'validation'` | 2,000 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]