


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 130640
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 4518
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 79928
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2025
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 5343
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 27050
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 43102
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 9212
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 9985
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 307405
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 15762
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 36
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 26145
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 626796
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 98225
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 37
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1114481
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 702
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2984
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 10130
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 80
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1172041
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 3398679
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1770
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2458067
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material de sua propriedade e, portanto, não devem ser reproduzidos aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 68210
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material alegadamente infrator e as informações razoavelmente suficientes para nos permitir localizar o material.

    Atenderemos às solicitações legítimas removendo as fontes afetadas da próxima versão do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 9006977
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento. Não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos o empacotamento real desses dados sob a licença Creative Commons CC0 ("sem direitos reservados") http://creativecommons.org/publicdomain/zero/1.0/ Na medida do possível sob a lei, a Inria renunciou a todos os direitos autorais e relacionados ou direitos conexos ao OSCAR Este trabalho foi publicado em: França.

    Se você considerar que nossos dados contêm material que é de sua propriedade e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique-se claramente, com dados de contacto detalhados, como morada, número de telefone ou endereço de e-mail através dos quais possa ser contactado.
    • Identifique claramente o trabalho protegido por direitos autorais alegadamente violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 360
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 4
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 82
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 14724
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 4771098
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 17024
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 84752
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 8203495
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 20661
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 68
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 12308039
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1909387
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 6582908
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 11
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 59448891
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 3883
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 169834
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 3084
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 529
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 617
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 617
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 108346
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 29054
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 18808
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1374
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 843195
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 166
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 212556
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Se você considerar que nossos dados contêm material que pertence a você e, portanto, não deve ser reproduzido aqui, por favor:

    • Identifique -se claramente, com dados de contato detalhados, como endereço, número de telefone ou endereço de e -mail no qual você pode ser contatado.
    • Identifique claramente o trabalho protegido por direitos autorais que afirmou ser violado.
    • Identifique claramente o material que se afirma estar violando e informações razoavelmente suficientes para nos permitir localizar o material.

    Vamos cumprir solicitações legítimas, removendo as fontes afetadas da próxima liberação do corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 7
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Licença : Esses dados são divulgados sob este esquema de licenciamento, não possuímos nenhum texto do qual esses dados foram extraídos. Licenciamos a embalagem real desses dados sob a Licença Creative Commons CC0 ("sem direitos reservados") http://creracivecommons.org/publicdomain/zero/1.0/ Na medida em que possível por lei, a INRIA renunciou a todos os direitos autorais e relacionados ou relacionados ou relacionados ou relacionados Direitos vizinhos de Oscar Este trabalho é publicado em: França.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 58
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2126
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 6485
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 67921
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 28522082
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 372158
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 5044757
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 17
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 3675420
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 68
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1381
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 72
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 13343
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 453904
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 183443
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 5
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 8714
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 109118
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2559
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2859
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 411
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 7121
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2820821
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 17610
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 42
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 645747
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 833101
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 4694
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 24
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 15074
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 677
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2418
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 11014487
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 56259
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 62398034
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 11596446
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 6521169
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 7782375
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 9897709
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 64
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 49
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 7324
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 158113
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 912330
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1675515
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2143
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 4042
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 20281
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 84
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2093621
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 41708901
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 2449
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 6999
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 42551
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 5869686
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 6046
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 4390754
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 103639
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 56326016
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 7664010
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 21018
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 121168
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 5326443
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 46493
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 484
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 321484
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 396093
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1578
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 13704702
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 33053
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 106
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 3264660
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 11197780
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 101
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 39496439
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 338073
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1377
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 86561
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 118
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1737411
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 2515
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 197878
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 16383
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 917
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 219334
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3229940
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 87235
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3463
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 34
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 8555
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 120684
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 461598
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 24803
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3749826
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 82738
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 428674
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3317
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 36
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 7
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 83663
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 14985
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 15446
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 586031
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 26795
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 42
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 56248
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 157698
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 65
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 96742378
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 5799
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 240691
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 7959
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1040
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 694
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 832
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 159363
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 46535
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 94588
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1401
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1593820
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 220
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 326804
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 8
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 61
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 4696
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 10709
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 98216
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 9387265
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 21
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 5492194
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1013619
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1263280
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 6456
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 34
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 27537
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1001
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3783
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 46981781
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 563916
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 7345075
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 203
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1485
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 88
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 17957
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 603937
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 534016
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 6
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 18174
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 185884
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 5213
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3225
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 452
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 14291
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 36700
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 156
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 17395625
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 89002
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 18535253
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 12973467
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 14898250
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 214
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 214
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 60137667
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 304230423
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 256513
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 7
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 284320
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 2375030
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 9
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 9948521
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 389515
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 1163
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 251064
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 924
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 21735
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 32652
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 25
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 299457
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 669
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 136639
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 55
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 20812149
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 44230
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 20682611
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 26920397
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 115954598
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 33925
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 886223
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 511
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 312644
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 294132
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 15503
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 64
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 9161
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 32919
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 201117
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 16365602
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 456
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 4
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 336
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 37085
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 21001388
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 104913504
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 10425596
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 88199221
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 8557453
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 83223
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 640
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 582219
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 659430
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 2638
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 62721527
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 524591
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1581
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 146993
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 137
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 2977757
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3212
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 395605
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 26598
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1055
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 299938
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 5546211
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 127467
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 4599
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 41
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 22301
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 203082
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 672077
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 41986
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 6064129
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 135923
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 638596
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3366
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 39
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 11
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 455994980
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 506883
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 7
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 544388
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 3808397
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 13
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 16236463
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 625673
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1445
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 350363
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1549
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 34807
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 52910
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 123
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 437871
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Divisões :

Dividir Exemplos
'train' 757
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 232329
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 73
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 34682142
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 59463
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 35440972
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 42114520
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 161836003
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 44280
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 1746604
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 805
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Versão : 1.0.0

  • Splits :

Dividir Exemplos
'train' 475703
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 458206
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 22255
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use o seguinte comando para carregar este conjunto de dados no TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 73
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 9760
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"


Use the following command to load this dataset in TFDS:

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • Descrição :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Dividir Exemplos
'train' 59364
  • Características :
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"