tatoeba

  • Description:

This data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.

For each languages, we have selected 1000 English sentences and their translations, if available. Please check this paper for a description of the languages, their families and scripts as well as baseline results.

Please note that the English sentences are not identical for all language pairs. This means that the results are not directly comparable across languages.

FeaturesDict({
    'source_language': Text(shape=(), dtype=string),
    'source_sentence': Text(shape=(), dtype=string),
    'target_language': Text(shape=(), dtype=string),
    'target_sentence': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
FeaturesDict
source_language Text string
source_sentence Text string
target_language Text string
target_sentence Text string
@article{tatoeba,
          title={Massively Multilingual Sentence Embeddings for Zero-Shot
                   Cross-Lingual Transfer and Beyond},
          author={Mikel, Artetxe and Holger, Schwenk,},
          journal={arXiv:1812.10464v2},
          year={2018}
}

@InProceedings{TIEDEMANN12.463,
  author = {J{\"o}rg}rg Tiedemann},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
  language = {english}
}

tatoeba/tatoeba_af (default config)

  • Download size: 58.24 KiB

  • Dataset size: 162.74 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_ar

  • Download size: 70.95 KiB

  • Dataset size: 175.46 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_bg

  • Download size: 99.88 KiB

  • Dataset size: 204.64 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_bn

  • Download size: 89.55 KiB

  • Dataset size: 194.24 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_de

  • Download size: 103.09 KiB

  • Dataset size: 207.93 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_el

  • Download size: 77.11 KiB

  • Dataset size: 181.65 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_es

  • Download size: 70.57 KiB

  • Dataset size: 175.12 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_et

  • Download size: 58.33 KiB

  • Dataset size: 162.85 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_eu

  • Download size: 64.52 KiB

  • Dataset size: 169.02 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_fa

  • Download size: 91.52 KiB

  • Dataset size: 196.15 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_fi

  • Download size: 73.90 KiB

  • Dataset size: 178.47 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_fr

  • Download size: 78.14 KiB

  • Dataset size: 182.68 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_he

  • Download size: 81.54 KiB

  • Dataset size: 186.15 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_hi

  • Download size: 119.69 KiB

  • Dataset size: 224.89 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_hu

  • Download size: 67.27 KiB

  • Dataset size: 171.78 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_id

  • Download size: 73.09 KiB

  • Dataset size: 177.61 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_it

  • Download size: 64.29 KiB

  • Dataset size: 168.81 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_ja

  • Download size: 90.90 KiB

  • Dataset size: 195.53 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_jv

  • Download size: 13.59 KiB

  • Dataset size: 35.01 KiB

  • Splits:

Split Examples
'train' 205

tatoeba/tatoeba_ka

  • Download size: 70.47 KiB

  • Dataset size: 148.67 KiB

  • Splits:

Split Examples
'train' 746

tatoeba/tatoeba_kk

  • Download size: 46.07 KiB

  • Dataset size: 106.25 KiB

  • Splits:

Split Examples
'train' 575

tatoeba/tatoeba_ko

  • Download size: 77.28 KiB

  • Dataset size: 181.88 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_ml

  • Download size: 92.50 KiB

  • Dataset size: 165.14 KiB

  • Splits:

Split Examples
'train' 687

tatoeba/tatoeba_mr

  • Download size: 98.19 KiB

  • Dataset size: 202.96 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_nl

  • Download size: 71.55 KiB

  • Dataset size: 176.10 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_pt

  • Download size: 73.42 KiB

  • Dataset size: 177.95 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_ru

  • Download size: 90.30 KiB

  • Dataset size: 194.92 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_sw

  • Download size: 19.99 KiB

  • Dataset size: 60.75 KiB

  • Splits:

Split Examples
'train' 390

tatoeba/tatoeba_ta

  • Download size: 38.52 KiB

  • Dataset size: 70.93 KiB

  • Splits:

Split Examples
'train' 307

tatoeba/tatoeba_te

  • Download size: 24.55 KiB

  • Dataset size: 49.07 KiB

  • Splits:

Split Examples
'train' 234

tatoeba/tatoeba_th

  • Download size: 61.72 KiB

  • Dataset size: 119.32 KiB

  • Splits:

Split Examples
'train' 548

tatoeba/tatoeba_tl

  • Download size: 66.54 KiB

  • Dataset size: 171.04 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_tr

  • Download size: 70.20 KiB

  • Dataset size: 174.70 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_ur

  • Download size: 86.63 KiB

  • Dataset size: 191.20 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_vi

  • Download size: 89.26 KiB

  • Dataset size: 193.89 KiB

  • Splits:

Split Examples
'train' 1,000

tatoeba/tatoeba_zh

  • Download size: 67.32 KiB

  • Dataset size: 171.85 KiB

  • Splits:

Split Examples
'train' 1,000