bucc
Stay organized with collections
Save and categorize content based on your preferences.
Identifying parallel sentences in comparable corpora. Given two sentence-split
monolingual corpora, participant systems are expected to identify pairs of
sentences that are translations of each other.
The BUCC mining task is a shared task on parallel sentence extraction from two
monolingual corpora with a subset of them assumed to be parallel, and that has
been available since 2016. For each language pair, the shared task provides a
monolingual corpus for each language and a gold mapping list containing true
translation pairs. These pairs are the ground truth. The task is to construct a
list of translation pairs from the monolingual corpora. The constructed list is
compared to the ground truth, and evaluated in terms of the F1 measure.
FeaturesDict({
'source_id': Text(shape=(), dtype=string),
'source_sentence': Text(shape=(), dtype=string),
'target_id': Text(shape=(), dtype=string),
'target_sentence': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
source_id |
Text |
|
string |
|
source_sentence |
Text |
|
string |
|
target_id |
Text |
|
string |
|
target_sentence |
Text |
|
string |
|
@inproceedings{zweigenbaum2018overview,
title={Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora},
author={Zweigenbaum, Pierre and Sharoff, Serge and Rapp, Reinhard},
booktitle={Proceedings of 11th Workshop on Building and Using Comparable Corpora},
pages={39--42},
year={2018}
}
bucc/bucc_de (default config)
Download size: 29.30 MiB
Dataset size: 3.21 MiB
Splits:
Split |
Examples |
'test' |
9,580 |
'validation' |
1,038 |
bucc/bucc_fr
Download size: 21.65 MiB
Dataset size: 2.90 MiB
Splits:
Split |
Examples |
'test' |
9,086 |
'validation' |
929 |
bucc/bucc_zh
Download size: 6.79 MiB
Dataset size: 615.20 KiB
Splits:
Split |
Examples |
'test' |
1,899 |
'validation' |
257 |
bucc/bucc_ru
Download size: 39.44 MiB
Dataset size: 6.36 MiB
Splits:
Split |
Examples |
'test' |
14,435 |
'validation' |
2,374 |
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2022-12-06 UTC.
[null,null,["Last updated 2022-12-06 UTC."],[],[],null,["# bucc\n\n\u003cbr /\u003e\n\n- **Description**:\n\nIdentifying parallel sentences in comparable corpora. Given two sentence-split\nmonolingual corpora, participant systems are expected to identify pairs of\nsentences that are translations of each other.\n\nThe BUCC mining task is a shared task on parallel sentence extraction from two\nmonolingual corpora with a subset of them assumed to be parallel, and that has\nbeen available since 2016. For each language pair, the shared task provides a\nmonolingual corpus for each language and a gold mapping list containing true\ntranslation pairs. These pairs are the ground truth. The task is to construct a\nlist of translation pairs from the monolingual corpora. The constructed list is\ncompared to the ground truth, and evaluated in terms of the F1 measure.\n\n- **Homepage** :\n \u003chttps://comparable.limsi.fr/bucc2018/\u003e\n\n- **Source code** :\n [`tfds.datasets.bucc.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/bucc/bucc_dataset_builder.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): Initial release.\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Yes\n\n- **Feature structure**:\n\n FeaturesDict({\n 'source_id': Text(shape=(), dtype=string),\n 'source_sentence': Text(shape=(), dtype=string),\n 'target_id': Text(shape=(), dtype=string),\n 'target_sentence': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|-----------------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| source_id | Text | | string | |\n| source_sentence | Text | | string | |\n| target_id | Text | | string | |\n| target_sentence | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Citation**:\n\n @inproceedings{zweigenbaum2018overview,\n title={Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora},\n author={Zweigenbaum, Pierre and Sharoff, Serge and Rapp, Reinhard},\n booktitle={Proceedings of 11th Workshop on Building and Using Comparable Corpora},\n pages={39--42},\n year={2018}\n }\n\nbucc/bucc_de (default config)\n-----------------------------\n\n- **Download size** : `29.30 MiB`\n\n- **Dataset size** : `3.21 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 9,580 |\n| `'validation'` | 1,038 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nbucc/bucc_fr\n------------\n\n- **Download size** : `21.65 MiB`\n\n- **Dataset size** : `2.90 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 9,086 |\n| `'validation'` | 929 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nbucc/bucc_zh\n------------\n\n- **Download size** : `6.79 MiB`\n\n- **Dataset size** : `615.20 KiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 1,899 |\n| `'validation'` | 257 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\nbucc/bucc_ru\n------------\n\n- **Download size** : `39.44 MiB`\n\n- **Dataset size** : `6.36 MiB`\n\n- **Splits**:\n\n| Split | Examples |\n|----------------|----------|\n| `'test'` | 14,435 |\n| `'validation'` | 2,374 |\n\n- **Examples** ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples..."]]