youtube_caption_corrections

References:

Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:youtube_caption_corrections')
  • Description:
Dataset built from pairs of YouTube captions where both 'auto-generated' and
'manually-corrected' captions are available for a single specified language.
This dataset labels two-way (e.g. ignoring single-sided insertions) same-length
token differences in the `diff_type` column. The `default_seq` is composed of
tokens from the 'auto-generated' captions. When a difference occurs between
the 'auto-generated' vs 'manually-corrected' captions types, the `correction_seq`
contains tokens from the 'manually-corrected' captions.
  • License: MIT License
  • Version: 0.0.0
  • Splits:
Split Examples
'train' 10769
  • Features:
{
    "video_ids": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "default_seq": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "correction_seq": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "diff_type": {
        "feature": {
            "num_classes": 9,
            "names": [
                "NO_DIFF",
                "CASE_DIFF",
                "PUNCUATION_DIFF",
                "CASE_AND_PUNCUATION_DIFF",
                "STEM_BASED_DIFF",
                "DIGIT_DIFF",
                "INTRAWORD_PUNC_DIFF",
                "UNKNOWN_TYPE_DIFF",
                "RESERVED_DIFF"
            ],
            "names_file": null,
            "id": null,
            "_type": "ClassLabel"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}