TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

bucc

Description:

Identifying parallel sentences in comparable corpora. Given two sentence-split monolingual corpora, participant systems are expected to identify pairs of sentences that are translations of each other.

The BUCC mining task is a shared task on parallel sentence extraction from two monolingual corpora with a subset of them assumed to be parallel, and that has been available since 2016. For each language pair, the shared task provides a monolingual corpus for each language and a gold mapping list containing true translation pairs. These pairs are the ground truth. The task is to construct a list of translation pairs from the monolingual corpora. The constructed list is compared to the ground truth, and evaluated in terms of the F1 measure.

Homepage: https://comparable.limsi.fr/bucc2018/
Source code: tfds.datasets.bucc.Builder
Versions:
- 1.0.0 (default): Initial release.
Auto-cached (documentation): Yes
Feature structure:

FeaturesDict({
    'source_id': Text(shape=(), dtype=string),
    'source_sentence': Text(shape=(), dtype=string),
    'target_id': Text(shape=(), dtype=string),
    'target_sentence': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
source_id	Text	string
source_sentence	Text	string
target_id	Text	string
target_sentence	Text	string

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Citation:

@inproceedings{zweigenbaum2018overview,
  title={Overview of the third BUCC shared task: Spotting parallel sentences  in comparable corpora},
  author={Zweigenbaum, Pierre and Sharoff, Serge and Rapp, Reinhard},
  booktitle={Proceedings of 11th Workshop on Building and Using Comparable Corpora},
  pages={39--42},
  year={2018}
}

bucc/bucc_de (default config)

Download size: 29.30 MiB
Dataset size: 3.21 MiB
Splits:

Split	Examples
`'test'`	9,580
`'validation'`	1,038

Examples (tfds.as_dataframe):

bucc/bucc_fr

Download size: 21.65 MiB
Dataset size: 2.90 MiB
Splits:

Split	Examples
`'test'`	9,086
`'validation'`	929

Examples (tfds.as_dataframe):

bucc/bucc_zh

Download size: 6.79 MiB
Dataset size: 615.20 KiB
Splits:

Split	Examples
`'test'`	1,899
`'validation'`	257

Examples (tfds.as_dataframe):

bucc/bucc_ru

Download size: 39.44 MiB
Dataset size: 6.36 MiB
Splits:

Split	Examples
`'test'`	14,435
`'validation'`	2,374

Examples (tfds.as_dataframe):