• Description:

Identifying parallel sentences in comparable corpora. Given two sentence-split monolingual corpora, participant systems are expected to identify pairs of sentences that are translations of each other.

The BUCC mining task is a shared task on parallel sentence extraction from two monolingual corpora with a subset of them assumed to be parallel, and that has been available since 2016. For each language pair, the shared task provides a monolingual corpus for each language and a gold mapping list containing true translation pairs. These pairs are the ground truth. The task is to construct a list of translation pairs from the monolingual corpora. The constructed list is compared to the ground truth, and evaluated in terms of the F1 measure.

    'source_id': Text(shape=(), dtype=tf.string),
    'source_sentence': Text(shape=(), dtype=tf.string),
    'target_id': Text(shape=(), dtype=tf.string),
    'target_sentence': Text(shape=(), dtype=tf.string),
  • Feature documentation:
Feature Class Shape Dtype Description
source_id Text tf.string
source_sentence Text tf.string
target_id Text tf.string
target_sentence Text tf.string
  title={Overview of the third BUCC shared task: Spotting parallel sentences  in comparable corpora},
  author={Zweigenbaum, Pierre and Sharoff, Serge and Rapp, Reinhard},
  booktitle={Proceedings of 11th Workshop on Building and Using Comparable Corpora},

bucc/bucc_de (default config)

  • Download size: 29.30 MiB

  • Dataset size: 3.21 MiB

  • Splits:

Split Examples
'test' 9,580
'validation' 1,038


  • Download size: 21.65 MiB

  • Dataset size: 2.90 MiB

  • Splits:

Split Examples
'test' 9,086
'validation' 929


  • Download size: 6.79 MiB

  • Dataset size: 615.20 KiB

  • Splits:

Split Examples
'test' 1,899
'validation' 257


  • Download size: 39.44 MiB

  • Dataset size: 6.36 MiB

  • Splits:

Split Examples
'test' 14,435
'validation' 2,374