टीएफडीएस अब क्रोइसैन 🥐 प्रारूप का समर्थन करता है! अधिक जानने के लिए दस्तावेज़ पढ़ें.

इस पेज का अनुवाद Cloud Translation API से किया गया है.

ऑस्कर

सन्दर्भ:

unshuffled_deduplicated_af

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	130640

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_als

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4518

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_arz

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	79928

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_an

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2025

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ast

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5343

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ba

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	27050

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_am

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	43102

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_as

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9212

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_azb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9985

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_बी

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	307405

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_बो

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	15762

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bxr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	36

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ceb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	26145

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_az

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	626796

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_बीसीएल

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cy

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	98225

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dsb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	37

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_बीएन

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1114481

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bs

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	702

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ce

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2984

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_सीवी

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	10130

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_diq

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eml

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	80

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_et

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1172041

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_बीजी

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3398679

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bpy

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1770

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ca

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2458067

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ckb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	68210

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ar

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9006977

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_av

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	360

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_बार

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_बीएच

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	82

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_br

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	14724

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_सीबीके

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_da

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4771098

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	17024

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	84752

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fa

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	8203495

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fy

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	20661

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	68

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cs

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	12308039

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_हाय

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1909387

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

अनशफ़ल्ड_डीडुप्लिकेटेड_हु

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	6582908

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ie

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।
क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	11

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	59448891

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gd

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3883

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	169834

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hsb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3084

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ia

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	529

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_io

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	617

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jbo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	617

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_km

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	108346

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ku

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	29054

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_la

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	18808

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lmo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1374

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	843195

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_min

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	166

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	212556

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mwl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nah

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	58

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_new

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2126

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_oc

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	6485

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pam

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ps

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	67921

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_it

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	28522082

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ka

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	372158

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ro

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5044757

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_scn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	17

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ko

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3675420

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kw

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	68

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lez

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1381

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lrc

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	72

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mg

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	13343

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ml

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	453904

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ms

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	183443

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_myv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nds

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	8714

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	109118

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_os

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2559

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pms

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2859

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_qu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	411

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sa

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7121

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2820821

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sh

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	17610

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_so

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	42

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	645747

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ta

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	833101

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।
क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4694

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tyv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	24

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uz

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	15074

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wa

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	677

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xmf

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2418

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	11014487

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tg

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	56259

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_de

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	62398034

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	11596446

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_el

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	6521169

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7782375

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vi

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9897709

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wuu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	64

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	49

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_als

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_als')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7324

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_arz

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	158113

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_az

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_az')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	912330

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bcl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1675515

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bs

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2143

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ce

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4042

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	20281

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_diq

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eml

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	84

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_et

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_et')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2093621

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_zh

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	41708901

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_an

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_an')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2449

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ast

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	6999

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ba

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	42551

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bg

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5869686

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bpy

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	6046

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ca

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4390754

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ckb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	103639

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_es

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	56326016

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_da

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_da')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7664010

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	21018

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	121168

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fi

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5326443

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ga

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	46493

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gom

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	484

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	321484

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hy

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	396093

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ilo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1578

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fa

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	13704702

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fy

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	33053

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	106

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hi

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3264660

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	11197780

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ie

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	101

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ja

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	39496439

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	338073

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_krc

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1377

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ky

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	86561

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_li

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	118

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lt

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1737411

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mhr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2515

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	197878

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mt

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	16383

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mzn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	917

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ne

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	219334

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_no

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3229940

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pa

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	87235

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pnb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3463

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_rm

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	34

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sah

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	8555

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_si

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	120684

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sq

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	461598

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sw

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	24803

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_th

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3749826

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tt

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	82738

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ur

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	428674

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3317

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xal

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	36

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yue

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_am

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_am')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	83663

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_as

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_as')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	14985

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_azb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	15446

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_be

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_be')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	586031

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	26795

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bxr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	42

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ceb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	56248

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cy

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	157698

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dsb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	65

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	96742378

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gd

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5799

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	240691

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hsb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7959

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ia

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1040

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_io

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_io')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	694

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jbo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	832

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_km

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_km')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	159363

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ku

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	46535

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_la

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_la')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	94588

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lmo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1401

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1593820

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_min

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_min')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	220

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	326804

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mwl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	8

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nah

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	61

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_new

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_new')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4696

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_oc

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	10709

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pam

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ps

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	98216

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ro

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9387265

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_scn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	21

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5492194

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1013619

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ta

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1263280

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	6456

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tyv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	34

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uz

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	27537

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wa

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1001

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xmf

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3783

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_it

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_it')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	46981781

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ka

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	563916

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ko

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7345075

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kw

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	203

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lez

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1485

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lrc

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	88

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mg

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	17957

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ml

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	603937

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ms

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	534016

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_myv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	6

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nds

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	18174

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	185884

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_os

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_os')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5213

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pms

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3225

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_qu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	452

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sa

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	14291

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sh

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	36700

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_so

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_so')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	156

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	17395625

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tg

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	89002

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	18535253

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	12973467

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vi

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	14898250

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wuu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	214

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	214

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_zh

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	60137667

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_en

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	304230423

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	256513

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_frr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	284320

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_he

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2375030

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ht

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_id

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9948521

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_is

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	389515

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1163

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	251064

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	924

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	21735

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	32652

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mai

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	25

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	299457

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mrj

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	669

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_my

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	136639

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nap

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	55

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	20812149

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_or

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	44230

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	20682611

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pt

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	26920397

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ru

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	115954598

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sd

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	33925

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	886223

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_su

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	511

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_te

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	312644

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	294132

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ug

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	15503

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vec

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	64

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_war

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9161

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yi

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	32919

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_af

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_af')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	201117

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ar

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	16365602

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_av

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_av')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	456

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bar

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bh

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	336

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_br

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_br')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	37085

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cbk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cs

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	21001388

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_de

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_de')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	104913504

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_el

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_el')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	10425596

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_es

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_es')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	88199221

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fi

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	8557453

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ga

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	83223

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gom

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	640

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	582219

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hy

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	659430

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ilo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2638

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ja

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	62721527

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	524591

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_krc

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1581

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ky

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	146993

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_li

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_li')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	137

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lt

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	2977757

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mhr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3212

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	395605

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mt

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	26598

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mzn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1055

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ne

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	299938

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_no

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_no')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	5546211

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pa

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	127467

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pnb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	4599

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_rm

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	41

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sah

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	22301

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_si

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_si')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	203082

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sq

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	672077

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sw

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	41986

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_th

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_th')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	6064129

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tt

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	135923

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ur

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	638596

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3366

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xal

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	39

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yue

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	11

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_en

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_en')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	455994980

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eu

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	506883

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_frr

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	7

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	544388

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_he

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_he')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	3808397

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ht

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	13

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_id

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_id')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	16236463

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_is

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_is')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	625673

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1445

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kn

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	350363

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kv

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1549

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lb

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	34807

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lo

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	52910

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mai

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	123

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mk

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	437871

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mrj

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	757

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_my

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_my')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	232329

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nap

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	73

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	34682142

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_or

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_or')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	59463

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	35440972

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pt

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	42114520

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ru

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	161836003

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sd

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	44280

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	1746604

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_su

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_su')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	805

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_te

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_te')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	475703

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tl

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	458206

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ug

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	22255

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vec

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	73

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_war

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_war')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	9760

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yi

इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')

विवरण :

The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.

License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
संस्करण : 1.0.0
विभाजन :

विभाजित करना	उदाहरण
`'train'`	59364

विशेषताएँ :

{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}