แอลซีซี

อ้างอิง:

ใหญ่

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:lccc/large')

คำอธิบาย :

LCCC: Large-scale Cleaned Chinese Conversation corpus (LCCC) is a large corpus of Chinese conversations.
A rigorous data cleaning pipeline is designed to ensure the quality of the corpus.
This pipeline involves a set of rules and several classifier-based filters.
Noises such as offensive or sensitive words, special symbols, emojis,
grammatically incorrect sentences, and incoherent conversations are filtered.

ใบอนุญาต : เอ็มไอที
เวอร์ชัน : 1.0.0
แยก :

แยก	ตัวอย่าง
`'train'`	12007759

คุณสมบัติ :

{
    "dialog": [
        {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        }
    ]
}

ฐาน

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:lccc/base')

คำอธิบาย :

LCCC: Large-scale Cleaned Chinese Conversation corpus (LCCC) is a large corpus of Chinese conversations.
A rigorous data cleaning pipeline is designed to ensure the quality of the corpus.
This pipeline involves a set of rules and several classifier-based filters.
Noises such as offensive or sensitive words, special symbols, emojis,
grammatically incorrect sentences, and incoherent conversations are filtered.

ใบอนุญาต : เอ็มไอที
เวอร์ชัน : 1.0.0
แยก :

แยก	ตัวอย่าง
`'test'`	10,000
`'train'`	6820506
`'validation'`	20,000

คุณสมบัติ :

{
    "dialog": [
        {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        }
    ]
}