- Description:
This data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are not directly comparable across languages.
Homepage: http://opus.nlpl.eu/Tatoeba.php
Source code:
tfds.datasets.tatoeba.BuilderVersions:
1.0.0(default): Initial release.
Auto-cached (documentation): Yes
Feature structure:
FeaturesDict({
'source_language': Text(shape=(), dtype=string),
'source_sentence': Text(shape=(), dtype=string),
'target_language': Text(shape=(), dtype=string),
'target_sentence': Text(shape=(), dtype=string),
})
- Feature documentation:
| Feature | Class | Shape | Dtype | Description |
|---|---|---|---|---|
| FeaturesDict | ||||
| source_language | Text | string | ||
| source_sentence | Text | string | ||
| target_language | Text | string | ||
| target_sentence | Text | string |
Supervised keys (See
as_superviseddoc):NoneFigure (tfds.show_examples): Not supported.
Citation:
@article{tatoeba,
title={Massively Multilingual Sentence Embeddings for Zero-Shot
Cross-Lingual Transfer and Beyond},
author={Mikel, Artetxe and Holger, Schwenk,},
journal={arXiv:1812.10464v2},
year={2018}
}
@InProceedings{TIEDEMANN12.463,
author = {J{\"o}rg}rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
tatoeba/tatoeba_af (default config)
Download size:
58.24 KiBDataset size:
162.74 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ar
Download size:
70.95 KiBDataset size:
175.46 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_bg
Download size:
99.88 KiBDataset size:
204.64 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_bn
Download size:
89.55 KiBDataset size:
194.24 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_de
Download size:
103.09 KiBDataset size:
207.93 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_el
Download size:
77.11 KiBDataset size:
181.65 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_es
Download size:
70.57 KiBDataset size:
175.12 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_et
Download size:
58.33 KiBDataset size:
162.85 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_eu
Download size:
64.52 KiBDataset size:
169.02 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_fa
Download size:
91.52 KiBDataset size:
196.15 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_fi
Download size:
73.90 KiBDataset size:
178.47 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_fr
Download size:
78.14 KiBDataset size:
182.68 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_he
Download size:
81.54 KiBDataset size:
186.15 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_hi
Download size:
119.69 KiBDataset size:
224.89 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_hu
Download size:
67.27 KiBDataset size:
171.78 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_id
Download size:
73.09 KiBDataset size:
177.61 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_it
Download size:
64.29 KiBDataset size:
168.81 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ja
Download size:
90.90 KiBDataset size:
195.53 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_jv
Download size:
13.59 KiBDataset size:
35.01 KiBSplits:
| Split | Examples |
|---|---|
'train' |
205 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ka
Download size:
70.47 KiBDataset size:
148.67 KiBSplits:
| Split | Examples |
|---|---|
'train' |
746 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_kk
Download size:
46.07 KiBDataset size:
106.25 KiBSplits:
| Split | Examples |
|---|---|
'train' |
575 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ko
Download size:
77.28 KiBDataset size:
181.88 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ml
Download size:
92.50 KiBDataset size:
165.14 KiBSplits:
| Split | Examples |
|---|---|
'train' |
687 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_mr
Download size:
98.19 KiBDataset size:
202.96 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_nl
Download size:
71.55 KiBDataset size:
176.10 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_pt
Download size:
73.42 KiBDataset size:
177.95 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ru
Download size:
90.30 KiBDataset size:
194.92 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_sw
Download size:
19.99 KiBDataset size:
60.75 KiBSplits:
| Split | Examples |
|---|---|
'train' |
390 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ta
Download size:
38.52 KiBDataset size:
70.93 KiBSplits:
| Split | Examples |
|---|---|
'train' |
307 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_te
Download size:
24.55 KiBDataset size:
49.07 KiBSplits:
| Split | Examples |
|---|---|
'train' |
234 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_th
Download size:
61.72 KiBDataset size:
119.32 KiBSplits:
| Split | Examples |
|---|---|
'train' |
548 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_tl
Download size:
66.54 KiBDataset size:
171.04 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_tr
Download size:
70.20 KiBDataset size:
174.70 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ur
Download size:
86.63 KiBDataset size:
191.20 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_vi
Download size:
89.26 KiBDataset size:
193.89 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_zh
Download size:
67.32 KiBDataset size:
171.85 KiBSplits:
| Split | Examples |
|---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):