TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

opus

Description:

OPUS is a collection of translated texts from the web.

Create your own config to choose which data / language pair to load.

config = tfds.translate.opus.OpusConfig(
    version=tfds.core.Version('0.1.0'),
    language_pair=("de", "en"),
    subsets=["GNOME", "EMEA"]
)
builder = tfds.builder("opus", config=config)

Additional Documentation: Explore on Papers With Code
Homepage: http://opus.nlpl.eu/
Source code: tfds.datasets.opus.Builder
Versions:
- 0.1.0 (default): No release notes.
Feature structure:

Translation({
    'de': Text(shape=(), dtype=string),
    'en': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Shape	Dtype	Description
	Translation
de	Text		string
en	Text		string

Supervised keys (See as_supervised doc): ('de', 'en')
Figure (tfds.show_examples): Not supported.
Citation:

@inproceedings{Tiedemann2012ParallelData,
  author = {Tiedemann, J},
  title = {Parallel Data, Tools and Interfaces in OPUS},
  booktitle = {LREC}
  year = {2012} }

opus/medical (default config)

Config description: medical documents
Download size: 34.29 MiB
Dataset size: 188.85 MiB
Auto-cached (documentation): Only when shuffle_files=False (train)
Splits:

Split	Examples
`'train'`	1,108,752

Examples (tfds.as_dataframe):

opus/law

Config description: law documents
Download size: 46.99 MiB
Dataset size: 214.44 MiB
Auto-cached (documentation): Only when shuffle_files=False (train)
Splits:

Split	Examples
`'train'`	719,372

Examples (tfds.as_dataframe):

opus/koran

Config description: koran documents
Download size: 35.42 MiB
Dataset size: 117.54 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	537,128

Examples (tfds.as_dataframe):

opus/IT

Config description: IT documents
Download size: 10.33 MiB
Dataset size: 42.51 MiB
Auto-cached (documentation): Yes
Splits:

Split	Examples
`'train'`	347,817

Examples (tfds.as_dataframe):

opus/subtitles

Config description: subtitles documents
Download size: 677.64 MiB
Dataset size: 2.01 GiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'train'`	22,512,639

Examples (tfds.as_dataframe):

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-12-15 UTC.