参考:
ner
使用以下命令在 TFDS 中加载此数据集:
ds = tfds.load('huggingface:id_nergrit_corpus/ner')
- 说明:
Nergrit Corpus is a dataset collection for Indonesian Named Entity Recognition, Statement Extraction, and Sentiment
Analysis. id_nergrit_corpus is the Named Entity Recognition of this dataset collection which contains 18 entities as
follow:
'CRD': Cardinal
'DAT': Date
'EVT': Event
'FAC': Facility
'GPE': Geopolitical Entity
'LAW': Law Entity (such as Undang-Undang)
'LOC': Location
'MON': Money
'NOR': Political Organization
'ORD': Ordinal
'ORG': Organization
'PER': Person
'PRC': Percent
'PRD': Product
'QTY': Quantity
'REG': Religion
'TIM': Time
'WOA': Work of Art
'LAN': Language
- 许可:无已知许可
- 版本:1.0.0
- 拆分:
拆分 | 样本 |
---|---|
'test' |
2399 |
'train' |
12532 |
'validation' |
2521 |
- 特征:
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"tokens": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"ner_tags": {
"feature": {
"num_classes": 39,
"names": [
"B-CRD",
"B-DAT",
"B-EVT",
"B-FAC",
"B-GPE",
"B-LAN",
"B-LAW",
"B-LOC",
"B-MON",
"B-NOR",
"B-ORD",
"B-ORG",
"B-PER",
"B-PRC",
"B-PRD",
"B-QTY",
"B-REG",
"B-TIM",
"B-WOA",
"I-CRD",
"I-DAT",
"I-EVT",
"I-FAC",
"I-GPE",
"I-LAN",
"I-LAW",
"I-LOC",
"I-MON",
"I-NOR",
"I-ORD",
"I-ORG",
"I-PER",
"I-PRC",
"I-PRD",
"I-QTY",
"I-REG",
"I-TIM",
"I-WOA",
"O"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}
sentiment
使用以下命令在 TFDS 中加载此数据集:
ds = tfds.load('huggingface:id_nergrit_corpus/sentiment')
- 说明:
Nergrit Corpus is a dataset collection for Indonesian Named Entity Recognition, Statement Extraction, and Sentiment
Analysis. id_nergrit_corpus is the Named Entity Recognition of this dataset collection which contains 18 entities as
follow:
'CRD': Cardinal
'DAT': Date
'EVT': Event
'FAC': Facility
'GPE': Geopolitical Entity
'LAW': Law Entity (such as Undang-Undang)
'LOC': Location
'MON': Money
'NOR': Political Organization
'ORD': Ordinal
'ORG': Organization
'PER': Person
'PRC': Percent
'PRD': Product
'QTY': Quantity
'REG': Religion
'TIM': Time
'WOA': Work of Art
'LAN': Language
- 许可:无已知许可
- 版本:1.0.0
- 拆分:
拆分 | 样本 |
---|---|
'test' |
2317 |
'train' |
7485 |
'validation' |
782 |
- 特征:
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"tokens": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"ner_tags": {
"feature": {
"num_classes": 7,
"names": [
"B-NEG",
"B-NET",
"B-POS",
"I-NEG",
"I-NET",
"I-POS",
"O"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}
statement
使用以下命令在 TFDS 中加载此数据集:
ds = tfds.load('huggingface:id_nergrit_corpus/statement')
- 说明:
Nergrit Corpus is a dataset collection for Indonesian Named Entity Recognition, Statement Extraction, and Sentiment
Analysis. id_nergrit_corpus is the Named Entity Recognition of this dataset collection which contains 18 entities as
follow:
'CRD': Cardinal
'DAT': Date
'EVT': Event
'FAC': Facility
'GPE': Geopolitical Entity
'LAW': Law Entity (such as Undang-Undang)
'LOC': Location
'MON': Money
'NOR': Political Organization
'ORD': Ordinal
'ORG': Organization
'PER': Person
'PRC': Percent
'PRD': Product
'QTY': Quantity
'REG': Religion
'TIM': Time
'WOA': Work of Art
'LAN': Language
- 许可:无已知许可
- 版本:1.0.0
- 拆分:
拆分 | 样本 |
---|---|
'test' |
335 |
'train' |
2405 |
'validation' |
176 |
- 特征:
{
"id": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"tokens": {
"feature": {
"dtype": "string",
"id": null,
"_type": "Value"
},
"length": -1,
"id": null,
"_type": "Sequence"
},
"ner_tags": {
"feature": {
"num_classes": 9,
"names": [
"B-BREL",
"B-FREL",
"B-STAT",
"B-WHO",
"I-BREL",
"I-FREL",
"I-STAT",
"I-WHO",
"O"
],
"names_file": null,
"id": null,
"_type": "ClassLabel"
},
"length": -1,
"id": null,
"_type": "Sequence"
}
}