opus

  • Description:

OPUS is a collection of translated texts from the web.

Create your own config to choose which data / language pair to load.

config = tfds.translate.opus.OpusConfig(
    version=tfds.core.Version('0.1.0'),
    language_pair=("de", "en"),
    subsets=["GNOME", "EMEA"]
)
builder = tfds.builder("opus", config=config)
Translation({
    'de': Text(shape=(), dtype=string),
    'en': Text(shape=(), dtype=string),
})
  • Feature documentation:
Feature Class Shape Dtype Description
Translation
de Text string
en Text string
@inproceedings{Tiedemann2012ParallelData,
  author = {Tiedemann, J},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {LREC}
  year = {2012} }

opus/medical (default config)

  • Config description: medical documents

  • Download size: 34.29 MiB

  • Dataset size: 188.85 MiB

  • Auto-cached (documentation): Only when shuffle_files=False (train)

  • Splits:

Split Examples
'train' 1,108,752

opus/law

  • Config description: law documents

  • Download size: 46.99 MiB

  • Dataset size: 214.44 MiB

  • Auto-cached (documentation): Only when shuffle_files=False (train)

  • Splits:

Split Examples
'train' 719,372

opus/koran

  • Config description: koran documents

  • Download size: 35.42 MiB

  • Dataset size: 117.54 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 537,128

opus/IT

  • Config description: IT documents

  • Download size: 10.33 MiB

  • Dataset size: 42.51 MiB

  • Auto-cached (documentation): Yes

  • Splits:

Split Examples
'train' 347,817

opus/subtitles

  • Config description: subtitles documents

  • Download size: 677.64 MiB

  • Dataset size: 2.01 GiB

  • Auto-cached (documentation): No

  • Splits:

Split Examples
'train' 22,512,639