multi_para_crawl

مراجع:

cs-is

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:multi_para_crawl/cs-is')
  • توضیحات :
Parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English. Here we only provide the additional language pairs that came out of pivoting. The bitexts for English are available from the ParaCrawl release.
40 languages, 669 bitexts
total number of files: 40
total number of tokens: 10.14G
total number of sentence fragments: 505.48M

Please, acknowledge the ParaCrawl project at http://paracrawl.eu. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 7.1.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 691006
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "cs",
            "is"
        ],
        "id": null,
        "_type": "Translation"
    }
}

ga-sk

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:multi_para_crawl/ga-sk')
  • توضیحات :
Parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English. Here we only provide the additional language pairs that came out of pivoting. The bitexts for English are available from the ParaCrawl release.
40 languages, 669 bitexts
total number of files: 40
total number of tokens: 10.14G
total number of sentence fragments: 505.48M

Please, acknowledge the ParaCrawl project at http://paracrawl.eu. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 7.1.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 390327
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "ga",
            "sk"
        ],
        "id": null,
        "_type": "Translation"
    }
}

lv-mt

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:multi_para_crawl/lv-mt')
  • توضیحات :
Parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English. Here we only provide the additional language pairs that came out of pivoting. The bitexts for English are available from the ParaCrawl release.
40 languages, 669 bitexts
total number of files: 40
total number of tokens: 10.14G
total number of sentence fragments: 505.48M

Please, acknowledge the ParaCrawl project at http://paracrawl.eu. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 7.1.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 464160
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "lv",
            "mt"
        ],
        "id": null,
        "_type": "Translation"
    }
}

nb-ru

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:multi_para_crawl/nb-ru')
  • توضیحات :
Parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English. Here we only provide the additional language pairs that came out of pivoting. The bitexts for English are available from the ParaCrawl release.
40 languages, 669 bitexts
total number of files: 40
total number of tokens: 10.14G
total number of sentence fragments: 505.48M

Please, acknowledge the ParaCrawl project at http://paracrawl.eu. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 7.1.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 399050
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "nb",
            "ru"
        ],
        "id": null,
        "_type": "Translation"
    }
}

de-tl

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:multi_para_crawl/de-tl')
  • توضیحات :
Parallel corpora from Web Crawls collected in the ParaCrawl project and further processed for making it a multi-parallel corpus by pivoting via English. Here we only provide the additional language pairs that came out of pivoting. The bitexts for English are available from the ParaCrawl release.
40 languages, 669 bitexts
total number of files: 40
total number of tokens: 10.14G
total number of sentence fragments: 505.48M

Please, acknowledge the ParaCrawl project at http://paracrawl.eu. This version is derived from the original release at their website adjusted for redistribution via the OPUS corpus collection. Please, acknowledge OPUS as well for this service.
  • مجوز : مجوز شناخته شده ای وجود ندارد
  • نسخه : 7.1.0
  • تقسیم ها :
تقسیم کنید نمونه ها
'train' 98156
  • ویژگی ها :
{
    "id": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "translation": {
        "languages": [
            "de",
            "tl"
        ],
        "id": null,
        "_type": "Translation"
    }
}