


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 130640
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 4518
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 79928
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2025
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 5343
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 27050
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 43102
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 9212
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 9985
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 307405
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 15762
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 36
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 26145
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 626796
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 98225
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 37
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1114481
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 702
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2984
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 10130
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 80
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1172041
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 3398679
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1770
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2458067
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 68210
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • उस सामग्री की स्पष्ट रूप से पहचान करें जिसके उल्लंघन का दावा किया गया है और हमें सामग्री का पता लगाने की अनुमति देने के लिए पर्याप्त जानकारी है।

    हम कॉर्पस की अगली रिलीज़ से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का अनुपालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 9006977
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किया गया है, हमारे पास उस पाठ का कोई स्वामित्व नहीं है जिससे ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित नहीं") के तहत इन डेटा की वास्तविक पैकेजिंग का लाइसेंस देते हैं http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव सीमा तक, इनरिया ने सभी कॉपीराइट और संबंधित को माफ कर दिया है या ऑस्कर के पड़ोसी अधिकार यह कार्य यहां से प्रकाशित हुआ है: फ़्रांस।

    क्या आपको यह विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री शामिल है जो आपके स्वामित्व में है और इसलिए इसे यहां पुन: प्रस्तुत नहीं किया जाना चाहिए, कृपया:

    • विस्तृत संपर्क डेटा जैसे पता, टेलीफ़ोन नंबर या ईमेल पता, जिस पर आपसे संपर्क किया जा सकता है, के साथ स्पष्ट रूप से अपनी पहचान बताएं।
    • उल्लंघन का दावा किए गए कॉपीराइट कार्य की स्पष्ट रूप से पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 360
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 4
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 82
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 14724
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 4771098
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 17024
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 84752
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 8203495
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 20661
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 68
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 12308039
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1909387
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 6582908
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 11
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 59448891
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 3883
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 169834
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 3084
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 529
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 617
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 617
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 108346
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 29054
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 18808
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1374
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 843195
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 166
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 212556
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    क्या आपको इस बात पर विचार करना चाहिए कि हमारे डेटा में ऐसी सामग्री है जो आपके स्वामित्व वाली है और इसलिए इसे यहां पुन: पेश नहीं किया जाना चाहिए, कृपया:

    • स्पष्ट रूप से अपने आप को पहचानें, विस्तृत संपर्क डेटा जैसे कि एक पता, टेलीफोन नंबर या ईमेल पता जिस पर आपसे संपर्क किया जा सकता है।
    • स्पष्ट रूप से उल्लंघन किए जाने वाले कॉपीराइट किए गए काम की पहचान करें।
    • स्पष्ट रूप से उस सामग्री की पहचान करें जो उल्लंघन करने और सूचनाओं का उल्लंघन करने का दावा किया जाता है ताकि हमें सामग्री का पता लगाने की अनुमति मिल सके।

    हम कॉर्पस की अगली रिलीज से प्रभावित स्रोतों को हटाकर वैध अनुरोधों का पालन करेंगे।

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 7
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • लाइसेंस : ये डेटा इस लाइसेंसिंग योजना के तहत जारी किए जाते हैं, हमारे पास कोई भी पाठ नहीं है जहां से ये डेटा निकाला गया है। हम क्रिएटिव कॉमन्स CC0 लाइसेंस ("कोई अधिकार सुरक्षित") के तहत इन आंकड़ों की वास्तविक पैकेजिंग को लाइसेंस देते हैं, http://creativecommons.org/publicdomain/zero/1.0/ कानून के तहत संभव हो, INRIA ने सभी कॉपीराइट और संबंधित या संबंधित या संबंधित हैं। ऑस्कर के लिए पड़ोसी अधिकार इस काम से प्रकाशित किया गया है: फ्रांस।

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 58
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2126
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 6485
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 67921
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 28522082
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 372158
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 5044757
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 17
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 3675420
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 68
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1381
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 72
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 13343
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 453904
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 183443
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 5
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 8714
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 109118
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2559
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2859
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 411
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 7121
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2820821
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 17610
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 42
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 645747
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 833101
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 4694
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 24
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 15074
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 677
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2418
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 11014487
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 56259
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 62398034
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 11596446
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 6521169
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 7782375
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 9897709
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 64
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 49
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 7324
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 158113
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 912330
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1675515
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2143
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 4042
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 20281
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 84
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2093621
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 41708901
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 2449
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 6999
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 42551
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 5869686
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 6046
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 4390754
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 103639
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 56326016
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 7664010
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 21018
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 121168
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 5326443
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 46493
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 484
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 321484
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 396093
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 1578
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 13704702
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 33053
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 106
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 3264660
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 11197780
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना उदाहरण
'train' 101
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 39496439
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 338073
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 1377
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 86561
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 118
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1737411
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 2515
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 197878
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 16383
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 917
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना उदाहरण
'train' 219334
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 3229940
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 87235
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 3463
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 34
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 8555
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 120684
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 461598
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 24803
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 3749826
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 82738
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 428674
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 3317
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 36
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 7
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 83663
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना उदाहरण
'train' 14985
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 15446
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 586031
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 26795
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 42
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 56248
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 157698
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 65
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 96742378
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 5799
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 240691
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 7959
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1040
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 694
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 832
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 159363
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 46535
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 94588
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1401
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1593820
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 220
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना उदाहरण
'train' 326804
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 8
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 61
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 4696
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 10709
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 3
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 98216
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 9387265
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 21
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 5492194
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1013619
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1263280
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 6456
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 34
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 27537
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1001
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 3783
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 46981781
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 563916
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 7345075
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 203
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1485
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 88
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 17957
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 603937
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 534016
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 6
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 18174
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 185884
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 5213
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 3225
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 452
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 14291
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 36700
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 156
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 17395625
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 89002
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 18535253
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 12973467
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 14898250
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 214
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 214
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 60137667
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 304230423
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना उदाहरण
'train' 256513
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 7
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 284320
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 2375030
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 9
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 9948521
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 389515
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1163
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 251064
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 924
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 21735
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 32652
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 25
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 299457
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 669
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 136639
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 55
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 20812149
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 44230
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 20682611
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 26920397
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 115954598
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 33925
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 886223
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 511
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 312644
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 294132
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 15503
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 64
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 9161
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 32919
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 201117
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 16365602
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 456
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 4
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 336
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 37085
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 21001388
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 104913504
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 10425596
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 88199221
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना उदाहरण
'train' 8557453
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 83223
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 640
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 582219
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 659430
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 2638
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 62721527
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 524591
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना उदाहरण
'train' 1581
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 146993
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 137
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 2977757
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 3212
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 395605
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 26598
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1055
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 299938
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 5546211
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 127467
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 4599
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 41
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 22301
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 203082
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 672077
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 41986
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 6064129
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 135923
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 638596
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 3366
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना उदाहरण
'train' 39
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 11
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 455994980
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 506883
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 7
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 544388
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 3808397
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 13
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 16236463
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 625673
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1445
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 350363
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 1549
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 34807
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 52910
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 123
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 437871
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 757
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 232329
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 73
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • संस्करण : 1.0.0

  • Splits :

विभाजित करना उदाहरण
'train' 34682142
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 59463
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 35440972
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


इस डेटासेट को TFDS में लोड करने के लिए निम्नलिखित कमांड का उपयोग करें:

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 42114520
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 161836003
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 44280
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • विभाजन :

विभाजित करना Examples
'train' 1746604
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 805
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 475703
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 458206
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 22255
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 73
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 9760
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • विवरण :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

विभाजित करना Examples
'train' 59364
  • विशेषताएँ :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"