paws_x_wiki

  • Description:

This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages:

  • French
  • Spanish
  • German
  • Chinese
  • Japanese
  • Korean

For further details, see the accompanying paper: PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification at https://arxiv.org/abs/1908.11828

Similar to PAWS Dataset, examples are split into Train/Dev/Test sections. All files are in the tsv format with four columns:

  1. id: A unique id for each pair.
  2. sentence1: The first sentence.
  3. sentence2: The second sentence.
  4. (noisy_)label: (Noisy) label for each pair.

Each label has two possible values: 0 indicates the pair has different meaning, while 1 indicates the pair is a paraphrase.

FeaturesDict({
    'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
    'sentence1': Text(shape=(), dtype=string),
    'sentence2': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
label ClassLabel int64
sentence1 Text string
sentence2 Text string
@InProceedings{pawsx2019emnlp,
  title = { {PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification} },
  author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
  booktitle = {Proc. of EMNLP},
  year = {2019}
}

paws_x_wiki/de (default config)

  • Config description: Translated to de

  • Dataset size: 15.27 MiB

  • Splits:

Split Examples
'test' 2,000
'train' 49,380
'validation' 2,000

paws_x_wiki/en

  • Config description: Translated to en

  • Dataset size: 14.59 MiB

  • Splits:

Split Examples
'test' 2,000
'train' 49,175
'validation' 2,000

paws_x_wiki/es

  • Config description: Translated to es

  • Dataset size: 15.27 MiB

  • Splits:

Split Examples
'test' 2,000
'train' 49,401
'validation' 1,961

paws_x_wiki/fr

  • Config description: Translated to fr

  • Dataset size: 15.79 MiB

  • Splits:

Split Examples
'test' 2,000
'train' 49,399
'validation' 1,988

paws_x_wiki/ja

  • Config description: Translated to ja

  • Dataset size: 17.77 MiB

  • Splits:

Split Examples
'test' 2,000
'train' 49,401
'validation' 2,000

paws_x_wiki/ko

  • Config description: Translated to ko

  • Dataset size: 16.42 MiB

  • Splits:

Split Examples
'test' 1,999
'train' 49,164
'validation' 2,000

paws_x_wiki/zh

  • Config description: Translated to zh

  • Dataset size: 13.20 MiB

  • Splits:

Split Examples
'test' 2,000
'train' 49,401
'validation' 2,000