- Description:
OPUS is a collection of translated texts from the web.
Create your own config to choose which data / language pair to load.
config = tfds.translate.opus.OpusConfig(
version=tfds.core.Version('0.1.0'),
language_pair=("de", "en"),
subsets=["GNOME", "EMEA"]
)
builder = tfds.builder("opus", config=config)
Additional Documentation: Explore on Papers With Code
Homepage: http://opus.nlpl.eu/
Source code:
tfds.datasets.opus.Builder
Versions:
0.1.0
(default): No release notes.
Feature structure:
Translation({
'de': Text(shape=(), dtype=string),
'en': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
Translation | ||||
de | Text | string | ||
en | Text | string |
Supervised keys (See
as_supervised
doc):('de', 'en')
Figure (tfds.show_examples): Not supported.
Citation:
@inproceedings{Tiedemann2012ParallelData,
author = {Tiedemann, J},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {LREC}
year = {2012} }
opus/medical (default config)
Config description: medical documents
Download size:
34.29 MiB
Dataset size:
188.85 MiB
Auto-cached (documentation): Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'train' |
1,108,752 |
- Examples (tfds.as_dataframe):
opus/law
Config description: law documents
Download size:
46.99 MiB
Dataset size:
214.44 MiB
Auto-cached (documentation): Only when
shuffle_files=False
(train)Splits:
Split | Examples |
---|---|
'train' |
719,372 |
- Examples (tfds.as_dataframe):
opus/koran
Config description: koran documents
Download size:
35.42 MiB
Dataset size:
117.54 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'train' |
537,128 |
- Examples (tfds.as_dataframe):
opus/IT
Config description: IT documents
Download size:
10.33 MiB
Dataset size:
42.51 MiB
Auto-cached (documentation): Yes
Splits:
Split | Examples |
---|---|
'train' |
347,817 |
- Examples (tfds.as_dataframe):
opus/subtitles
Config description: subtitles documents
Download size:
677.64 MiB
Dataset size:
2.01 GiB
Auto-cached (documentation): No
Splits:
Split | Examples |
---|---|
'train' |
22,512,639 |
- Examples (tfds.as_dataframe):