References:
unshuffled_deduplicated_af
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
130640 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_als
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
4518 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_arz
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
79928 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_an
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
2025 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ast
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
5343 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ba
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
27050 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_am
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
43102 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_as
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
9212 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_azb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
9985 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_be
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
307405 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bo
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
15762 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bxr
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
36 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ceb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
26145 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_az
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
626796 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bcl
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cy
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
98225 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dsb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
37 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bn
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1114481 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bs
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
702 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ce
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
2984 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cv
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
10130 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_diq
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eml
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
80 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_et
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1172041 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bg
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
3398679 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bpy
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1770 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ca
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
2458067 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ckb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
68210 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ar
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
9006977 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_av
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
360 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bar
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
4 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bh
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
82 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_br
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
14724 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cbk
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_da
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
4771098 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dv
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
17024 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eo
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
84752 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fa
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
8203495 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fy
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
20661 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gn
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
68 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cs
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
12308039 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hi
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1909387 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hu
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
6582908 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ie
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
11 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fr
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
59448891 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gd
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
3883 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gu
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
169834 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hsb
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
3084 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ia
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
529 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_io
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
617 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jbo
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
617 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_km
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
108346 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ku
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
29054 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_la
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
18808 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lmo
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1374 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lv
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
843195 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_min
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
166 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mr
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
212556 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mwl
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
7 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nah
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
58 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
2126 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
6485 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
67921 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
28522082 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
372158 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
5044757 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
17 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
3675420 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
68 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
1381 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
72 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
13343 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
453904 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
183443 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
5 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
8714 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License: These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version: 1.0.0
Splits:
Split | Examples |
---|---|
'train' |
109118 |
- Features:
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
Use the following command to load this dataset in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- Description:
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of