brwac

Người giới thiệu:

Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:

ds = tfds.load('huggingface:brwac')
  • Mô tả :
The BrWaC (Brazilian Portuguese Web as Corpus) is a large corpus constructed following the Wacky framework,
which was made public for research purposes. The current corpus version, released in January 2017, is composed by
3.53 million documents, 2.68 billion tokens and 5.79 million types. Please note that this resource is available
solely for academic research purposes, and you agreed not to use it for any commercial applications.
Manually download at https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC
  • Giấy phép : Không có giấy phép đã biết
  • Phiên bản : 1.0.0
  • Tách :
Tách ra Các ví dụ
'train' 3530796
  • Các tính năng :
{
    "doc_id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "title": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "uri": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "feature": {
            "paragraphs": {
                "feature": {
                    "dtype": "string",
                    "id": null,
                    "_type": "Value"
                },
                "length": -1,
                "id": null,
                "_type": "Sequence"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}