参考:
best2009
使用以下命令在 TFDS 中加载此数据集:
ds = tfds.load('huggingface:best2009/best2009')
- 说明:
`best2009` is a Thai word-tokenization dataset from encyclopedia, novels, news and articles by
[NECTEC](https://www.nectec.or.th/) (148,995/2,252 lines of train/test). It was created for
[BEST 2010: Word Tokenization Competition](https://thailang.nectec.or.th/archive/indexa290.html?q=node/10).
The test set answers are not provided publicly.
- 许可:CC-BY-NC-SA 3.0
- 版本:1.0.0
- 拆分:
拆分 | 样本 |
---|---|
'test' |
2252 |
'train' |
148995 |
- 特征:
{
"fname": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"char": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"char_type": {
"feature": {
"num_classes": 12,
"names": [
"b_e",
"c",
"d",
"n",
"o",
"p",
"q",
"s",
"s_e",
"t",
"v",
"w"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"is_beginning": {
"feature": {
"num_classes": 2,
"names": [
"neg",
"pos"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}