wiki_bio

Tài liệu tham khảo:

Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:

ds = tfds.load('huggingface:wiki_bio')

Sự miêu tả :

This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation
algorithms. For each article, we provide the first paragraph and the infobox (both tokenized).
For each article, we extracted the first paragraph (text), the infobox (structured data). Each
infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP
(http://stanfordnlp.github.io/CoreNLP/) to preprocess the data, i.e. we broke the text into
sentences and tokenized both the text and the field values. The dataset was randomly split in
three subsets train (80%), valid (10%), test (10%).

Giấy phép : CC BY-SA 3.0
Phiên bản : 1.2.0
Chia tách :

Tách ra	Ví dụ
`'test'`	72831
`'train'`	582659
`'val'`	72831

Đặc trưng :

{
    "input_text": {
        "table": {
            "feature": {
                "column_header": {
                    "dtype": "string",
                    "id": null,
                    "_type": "Value"
                },
                "row_number": {
                    "dtype": "int16",
                    "id": null,
                    "_type": "Value"
                },
                "content": {
                    "dtype": "string",
                    "id": null,
                    "_type": "Value"
                }
            },
            "length": -1,
            "id": null,
            "_type": "Sequence"
        },
        "context": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        }
    },
    "target_text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}