Riferimenti:
unshuffled_deduplicated_af
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 130640 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_als
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 4518 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_arz
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 79928 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_an
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2025 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ast
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 5343 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ba
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo alcun testo da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 27050 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_am
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 43102 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_as
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 9212 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_azb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 9985 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_be
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 307405 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo alcun testo da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 15762 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bxr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 36 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ceb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo alcun testo da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 26145 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_az
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 626796 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bcl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cy
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo alcun testo da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 98225 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dsb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo alcun testo da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 37 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo alcun testo da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1114481 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bs
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo alcun testo da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 702 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ce
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2984 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 10130 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_diq
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo alcun testo da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eml
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 80 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_et
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1172041 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bg
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3398679 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bpy
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1770 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ca
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2458067 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ckb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 68210 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ar
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si ritiene sia in violazione e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 9006977 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_av
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati secondo questo schema di licenza. Non possediamo nessuno dei testi da cui sono stati estratti questi dati. Concediamo in licenza l'effettivo confezionamento di questi dati ai sensi della licenza Creative Commons CC0 ("nessun diritto riservato") http://creativecommons.org/publicdomain/zero/1.0/ Nella misura consentita dalla legge, Inria ha rinunciato a tutti i diritti d'autore e relativi o diritti connessi all'OSCAR Quest'opera è pubblicata da: Francia.
Se ritieni che i nostri dati contengano materiale di tua proprietà e che pertanto non debba essere riprodotto qui, ti preghiamo di:
- Identificatevi chiaramente, con dati di contatto dettagliati come indirizzo, numero di telefono o indirizzo e-mail al quale potete essere contattati.
- Identificare chiaramente l'opera protetta da copyright che si ritiene sia stata violata.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 360 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_BAR
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 4 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_BH
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 82 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_BR
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 14724 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_CBK
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_DA
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 4771098 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_DV
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 17024 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_EO
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 84752 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_FA
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 8203495 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_FY
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 20661 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_GN
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 68 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUffled_Deduplicated_cs
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 12308039 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_HI
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1909387 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUffled_Deduplicated_hu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 6582908 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUffled_Deduplicated_ie
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 11 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_FR
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 59448891 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_GD
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3883 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_GU
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 169834 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_HSB
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3084 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_IA
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 529 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_IO
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 617 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_JBO
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 617 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_KM
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 108346 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_KU
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 29054 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_LA
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 18808 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUffled_Deduplicated_lmo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1374 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_LV
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 843195 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Unshuffled_deduplicated_min
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 166 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_MR
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 212556 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUffled_Deduplicated_mwl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Se si considera che i nostri dati contengano materiale di proprietà di te e quindi non dovrebbe essere riprodotto qui, per favore:
- Identifica chiaramente te stesso, con dati di contatto dettagliati come un indirizzo, numero di telefono o indirizzo e -mail in cui è possibile contattare.
- Identificare chiaramente il lavoro protetto da copyright sostenuto.
- Identificare chiaramente il materiale che si afferma che si viola e informazioni ragionevolmente sufficienti per consentirci di individuare il materiale.
Rispetteremo le richieste legittime rimuovendo le fonti interessate dalla prossima versione del corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
UNSHUFFLED_DEDUPLICATO_NAH
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Licenza : questi dati vengono rilasciati nell'ambito di questo schema di licenza non possediamo nessuno dei quali sono stati estratti questi dati. Licenza al packaging effettivo di questi dati ai sensi della licenza CCC0 Creative Commons ("Nessun diritti riservato") http://creativecommons.org/publicdomain/zero/1.0/ nella misura possibile ai sensi della legge, Inria ha rinunciato a tutti i copyright e correlati o Diritti vicini a Oscar Questo lavoro è pubblicato da: Francia.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 58 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2126 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 6485 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 67921 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 28522082 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 372158 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 5044757 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 17 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3675420 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 68 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1381 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 72 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 13343 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 453904 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 183443 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 5 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 8714 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 109118 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2559 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2859 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 411 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7121 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2820821 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 17610 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 42 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 645747 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ta
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 833101 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 4694 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tyv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 24 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uz
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 15074 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wa
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 677 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2418 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 11014487 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 56259 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 62398034 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 11596446 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 6521169 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7782375 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vi
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 9897709 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wuu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 64 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 49 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_als
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7324 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_arz
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 158113 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_az
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 912330 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1675515 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2143 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 4042 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 20281 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 84 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2093621 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_zh
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 41708901 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_an
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2449 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ast
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 6999 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 42551 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 5869686 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 6046 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 4390754 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 103639 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 56326016 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7664010 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 21018 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 121168 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 5326443 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 46493 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 484 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 321484 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 396093 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1578 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 13704702 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fy
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 33053 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 106 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3264660 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 11197780 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 101 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 39496439 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 338073 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1377 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 86561 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 118 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1737411 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2515 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 197878 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 16383 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 917 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 219334 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3229940 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 87235 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pnb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3463 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 34 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sah
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 8555 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 120684 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 461598 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 24803 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_th
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3749826 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tt
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 82738 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ur
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 428674 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3317 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 36 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yue
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_am
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 83663 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_as
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 14985 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 15446 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 586031 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 26795 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 42 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 56248 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cy
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 157698 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 65 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 96742378 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 5799 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 240691 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hsb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7959 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1040 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 694 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 832 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 159363 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 46535 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 94588 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1401 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1593820 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 220 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 326804 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 8 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 61 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 4696 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 10709 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 98216 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 9387265 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 21 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 5492194 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1013619 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ta
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1263280 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 6456 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tyv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 34 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uz
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 27537 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wa
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1001 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xmf
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3783 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 46981781 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ka
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 563916 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ko
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7345075 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kw
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 203 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lez
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1485 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lrc
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 88 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mg
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 17957 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ml
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 603937 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ms
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 534016 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_myv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 6 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nds
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 18174 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 185884 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_os
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_os')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 5213 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pms
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 3225 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_qu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 452 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sa
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 14291 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sh
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 36700 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_so
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_so')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 156 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 17395625 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tg
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 89002 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 18535253 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 12973467 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vi
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 14898250 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wuu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 214 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 214 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_zh
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 60137667 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_en
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 304230423 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 256513 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_frr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 7 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 284320 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_he
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2375030 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ht
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 9 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_id
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 9948521 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_is
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 389515 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1163 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 251064 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 924 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 21735 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 32652 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mai
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 25 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 299457 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mrj
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 669 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_my
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 136639 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nap
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 55 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 20812149 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_or
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 44230 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 20682611 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pt
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 26920397 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ru
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 115954598 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sd
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 33925 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 886223 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_su
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 511 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_te
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 312644 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 294132 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ug
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 15503 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vec
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 64 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_war
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 9161 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yi
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 32919 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_af
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_af')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 201117 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ar
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 16365602 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_av
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_av')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 456 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bar
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 4 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bh
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 336 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_br
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_br')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 37085 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cbk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cs
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 21001388 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_de
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_de')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 104913504 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_el
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_el')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 10425596 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_es
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_es')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 88199221 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fi
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 8557453 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ga
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 83223 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gom
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 640 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 582219 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hy
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 659430 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ilo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2638 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ja
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 62721527 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 524591 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_krc
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 1581 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ky
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 146993 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_li
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_li')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 137 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lt
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 2977757 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mhr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 3212 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 395605 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mt
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 26598 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mzn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 1055 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ne
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 299938 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_no
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_no')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 5546211 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pa
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 127467 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pnb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 4599 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_rm
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 41 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sah
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 22301 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_si
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_si')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 203082 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sq
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 672077 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sw
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 41986 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_th
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_th')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 6064129 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tt
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 135923 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ur
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 638596 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 3366 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xal
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 39 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yue
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 11 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_en
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_en')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 455994980 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eu
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 506883 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_frr
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 7 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 544388 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_he
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_he')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 3808397 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ht
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 13 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_id
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_id')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 16236463 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_is
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_is')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 625673 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 1445 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kn
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 350363 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kv
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 1549 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lb
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 34807 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lo
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 52910 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mai
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 123 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mk
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 437871 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mrj
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Esempi |
---|---|
'train' | 757 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_my
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_my')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 232329 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nap
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 73 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 34682142 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_or
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_or')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 59463 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 35440972 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pt
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 42114520 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ru
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 161836003 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sd
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Versione : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 44280 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 1746604 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_su
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_su')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 805 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_te
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_te')
- Descrizione :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 475703 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tl
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 458206 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ug
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 22255 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vec
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 73 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_war
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_war')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 9760 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yi
Utilizzare il comando seguente per caricare questo set di dati in TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Divide :
Diviso | Examples |
---|---|
'train' | 59364 |
- Caratteristiche :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}