- Description:
This data is extracted from the Tatoeba corpus, dated Saturday 2018/11/17.
For each languages, we have selected 1000 English sentences and their translations, if available. Please check this paper for a description of the languages, their families and scripts as well as baseline results.
Please note that the English sentences are not identical for all language pairs. This means that the results are not directly comparable across languages.
Homepage: http://opus.nlpl.eu/Tatoeba.php
Source code:
tfds.datasets.tatoeba.Builder
Versions:
1.0.0
(default): Initial release.
Auto-cached (documentation): Yes
Feature structure:
FeaturesDict({
'source_language': Text(shape=(), dtype=string),
'source_sentence': Text(shape=(), dtype=string),
'target_language': Text(shape=(), dtype=string),
'target_sentence': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
source_language | Text | string | ||
source_sentence | Text | string | ||
target_language | Text | string | ||
target_sentence | Text | string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
@article{tatoeba,
title={Massively Multilingual Sentence Embeddings for Zero-Shot
Cross-Lingual Transfer and Beyond},
author={Mikel, Artetxe and Holger, Schwenk,},
journal={arXiv:1812.10464v2},
year={2018}
}
@InProceedings{TIEDEMANN12.463,
author = {J{\"o}rg}rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
tatoeba/tatoeba_af (default config)
Download size:
58.24 KiB
Dataset size:
162.74 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ar
Download size:
70.95 KiB
Dataset size:
175.46 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_bg
Download size:
99.88 KiB
Dataset size:
204.64 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_bn
Download size:
89.55 KiB
Dataset size:
194.24 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_de
Download size:
103.09 KiB
Dataset size:
207.93 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_el
Download size:
77.11 KiB
Dataset size:
181.65 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_es
Download size:
70.57 KiB
Dataset size:
175.12 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_et
Download size:
58.33 KiB
Dataset size:
162.85 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_eu
Download size:
64.52 KiB
Dataset size:
169.02 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_fa
Download size:
91.52 KiB
Dataset size:
196.15 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_fi
Download size:
73.90 KiB
Dataset size:
178.47 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_fr
Download size:
78.14 KiB
Dataset size:
182.68 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_he
Download size:
81.54 KiB
Dataset size:
186.15 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_hi
Download size:
119.69 KiB
Dataset size:
224.89 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_hu
Download size:
67.27 KiB
Dataset size:
171.78 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_id
Download size:
73.09 KiB
Dataset size:
177.61 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_it
Download size:
64.29 KiB
Dataset size:
168.81 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ja
Download size:
90.90 KiB
Dataset size:
195.53 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_jv
Download size:
13.59 KiB
Dataset size:
35.01 KiB
Splits:
Split | Examples |
---|---|
'train' |
205 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ka
Download size:
70.47 KiB
Dataset size:
148.67 KiB
Splits:
Split | Examples |
---|---|
'train' |
746 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_kk
Download size:
46.07 KiB
Dataset size:
106.25 KiB
Splits:
Split | Examples |
---|---|
'train' |
575 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ko
Download size:
77.28 KiB
Dataset size:
181.88 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ml
Download size:
92.50 KiB
Dataset size:
165.14 KiB
Splits:
Split | Examples |
---|---|
'train' |
687 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_mr
Download size:
98.19 KiB
Dataset size:
202.96 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_nl
Download size:
71.55 KiB
Dataset size:
176.10 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_pt
Download size:
73.42 KiB
Dataset size:
177.95 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ru
Download size:
90.30 KiB
Dataset size:
194.92 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_sw
Download size:
19.99 KiB
Dataset size:
60.75 KiB
Splits:
Split | Examples |
---|---|
'train' |
390 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ta
Download size:
38.52 KiB
Dataset size:
70.93 KiB
Splits:
Split | Examples |
---|---|
'train' |
307 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_te
Download size:
24.55 KiB
Dataset size:
49.07 KiB
Splits:
Split | Examples |
---|---|
'train' |
234 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_th
Download size:
61.72 KiB
Dataset size:
119.32 KiB
Splits:
Split | Examples |
---|---|
'train' |
548 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_tl
Download size:
66.54 KiB
Dataset size:
171.04 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_tr
Download size:
70.20 KiB
Dataset size:
174.70 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_ur
Download size:
86.63 KiB
Dataset size:
191.20 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_vi
Download size:
89.26 KiB
Dataset size:
193.89 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):
tatoeba/tatoeba_zh
Download size:
67.32 KiB
Dataset size:
171.85 KiB
Splits:
Split | Examples |
---|---|
'train' |
1,000 |
- Examples (tfds.as_dataframe):