oscar

Referanslar:

unshuffled_deduplicated_af

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişim kurulabilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 130640
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_als

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 4518
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_arz

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişim kurulabilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 79928
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_an

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2025
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ast

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 5343
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ba

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 27050
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_am

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 43102
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_as

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 9212
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_azb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 9985
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_be

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 307405
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 15762
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bxr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 36
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ceb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişim kurulabilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 26145
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_az

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 626796
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bcl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cy

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 98225
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_dsb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 37
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1114481
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bs

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 702
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ce

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2984
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_cv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 10130
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_diq

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eml

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 80
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_et

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1172041
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bg

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 3398679
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_bpy

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1770
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ca

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2458067
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ckb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 68210
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ar

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • İhlalde bulunduğu iddia edilen materyali ve materyali bulmamıza olanak sağlayacak makul ölçüde yeterli bilgiyi açıkça tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 9006977
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_av

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler bu lisans şeması kapsamında yayınlanmaktadır. Bu verilerin alındığı hiçbir metnin sahibi değiliz. Bu verilerin fiili paketlenmesine Creative Commons CC0 lisansı ("hiçbir hak saklı değildir") kapsamında lisans veriyoruz http://creativecommons.org/publicdomain/zero/1.0/ Inria, kanunlar kapsamında mümkün olduğu ölçüde, tüm telif haklarından ve ilgili veya OSCAR'a komşu haklar Bu çalışma yayınlandığı kaynak: Fransa.

    Verilerimizin size ait materyal içerdiğini ve bu nedenle burada çoğaltılmaması gerektiğini düşünüyorsanız lütfen:

    • Sizinle iletişime geçilebilecek adres, telefon numarası veya e-posta adresi gibi ayrıntılı iletişim bilgileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça tanımlayın.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 360
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_ddeduplicated_bar

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 4
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unffuffled_deduplicated_bh

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 82
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unffuffled_deduplicated_br

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 14724
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unffuffled_deduplicated_cbk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_deduplicated_da

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 4771098
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unffuffled_deduplicated_dv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 17024
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfffled_deduplicated_eo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 84752
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_deduplicated_fa

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 8203495
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unffuffled_deduplicated_fy

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 20661
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unffuffled_deduplicated_gn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 68
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_cs

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 12308039
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_hi

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1909387
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_hu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 6582908
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_deduplicated_ie

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 11
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_deduplicated_fr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 59448891
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_deduplicated_gd

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 3883
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_gu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 169834
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unffuffled_deduplicated_hsb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 3084
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_deduplicated_ia

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 529
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_io

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 617
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unffuffled_deduplicated_jbo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 617
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unffuffled_deduplicated_km

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 108346
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_deduplicated_ku

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 29054
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Shuffled_deduplicated_la

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 18808
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_lmo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1374
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_lv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 843195
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_min

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 166
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_mr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 212556
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_mwl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Adres, telefon numarası veya sizinle iletişime geçilebilecek e -posta adresi gibi ayrıntılı iletişim verileriyle kendinizi açıkça tanımlayın.
    • İhlal edildiği iddia edilen telif hakkıyla korunan çalışmayı açıkça belirleyin.
    • Materyali bulmamıza izin vermek için ihlal ettiği iddia edilen materyali ve bilgileri makul bir şekilde yeterli olarak tanımlayın.

    Etkilenen kaynakları korpusun bir sonraki sürümünden kaldırarak meşru taleplere uyacağız.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 7
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

Unwfflled_deduplicated_nah

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • Lisans : Bu veriler, bu verilerin çıkarıldığı metnin hiçbirine sahip değiliz. Bu verilerin gerçek ambalajını Creative Commons CC0 lisansı ("Hakları Saklıdır") altında lisanslıyoruz http://creativecommons.org/publicdomain/zero/1.0/ yasalar altında mümkün olan ölçüde, Inria tüm telif hakkı ve ilgili veya ilgili feragat etti. Oscar'a Komşu Haklar Bu çalışma: Fransa'dan yayınlandı.

    Verilerimizin size ait olduğu ve bu nedenle burada çoğaltılmaması gerektiğini düşünürseniz, lütfen:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 58
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_new

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2126
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_oc

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 6485
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pam

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ps

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 67921
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_it

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 28522082
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ka

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 372158
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ro

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 5044757
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_scn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 17
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ko

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 3675420
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kw

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 68
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lez

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1381
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lrc

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 72
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mg

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 13343
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ml

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 453904
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ms

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 183443
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_myv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 5
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nds

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 8714
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 109118
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_os

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2559
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pms

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2859
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_qu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 411
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sa

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 7121
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2820821
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sh

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 17610
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_so

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 42
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 645747
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ta

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 833101
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 4694
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tyv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 24
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uz

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 15074
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wa

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 677
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xmf

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2418
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 11014487
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tg

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 56259
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_de

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 62398034
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 11596446
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_el

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 6521169
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_uk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 7782375
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vi

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 9897709
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_wuu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 64
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 49
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_als

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_als')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 7324
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_arz

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 158113
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_az

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_az')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 912330
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bcl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1675515
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bs

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2143
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ce

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 4042
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 20281
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_diq

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eml

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 84
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_et

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_et')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2093621
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_zh

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 41708901
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_an

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_an')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2449
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ast

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 6999
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ba

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 42551
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bg

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 5869686
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bpy

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 6046
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ca

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 4390754
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ckb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 103639
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_es

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 56326016
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_da

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_da')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 7664010
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 21018
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 121168
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_fi

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 5326443
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ga

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 46493
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gom

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 484
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 321484
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_hy

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 396093
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ilo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1578
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fa

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 13704702
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fy

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 33053
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 106
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hi

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 3264660
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 11197780
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ie

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 101
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ja

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 39496439
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 338073
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_krc

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1377
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ky

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 86561
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_li

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 118
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lt

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1737411
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mhr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 2515
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 197878
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mt

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 16383
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mzn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 917
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ne

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 219334
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_no

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3229940
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pa

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 87235
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pnb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3463
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_rm

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 34
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sah

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 8555
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_si

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 120684
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sq

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 461598
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sw

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 24803
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_th

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3749826
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tt

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 82738
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ur

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 428674
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3317
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_xal

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 36
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yue

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 7
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_am

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_am')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 83663
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_as

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_as')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 14985
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_azb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 15446
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_be

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_be')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 586031
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 26795
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bxr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 42
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ceb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 56248
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cy

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 157698
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_dsb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 65
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 96742378
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gd

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 5799
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 240691
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hsb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 7959
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ia

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1040
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_io

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_io')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 694
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jbo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 832
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_km

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_km')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 159363
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ku

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 46535
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_la

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_la')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 94588
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lmo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1401
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1593820
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_min

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_min')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 220
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 326804
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mwl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 8
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nah

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 61
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_new

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_new')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 4696
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_oc

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 10709
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pam

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ps

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 98216
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ro

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 9387265
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_scn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 21
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 5492194
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1013619
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ta

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 1263280
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 6456
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tyv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 34
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uz

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 27537
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wa

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1001
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xmf

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3783
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_it

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_it')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 46981781
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ka

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 563916
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ko

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 7345075
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kw

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 203
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lez

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1485
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lrc

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 88
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mg

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 17957
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ml

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 603937
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ms

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 534016
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_myv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 6
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nds

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 18174
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 185884
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_os

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_os')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 5213
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pms

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3225
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_qu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 452
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sa

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 14291
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sh

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 36700
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_so

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_so')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 156
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 17395625
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tg

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 89002
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 18535253
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_uk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 12973467
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vi

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 14898250
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_wuu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 214
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 214
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_zh

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 60137667
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_en

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 304230423
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_eu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 256513
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_frr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 7
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_gl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 284320
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_he

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2375030
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ht

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 9
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_id

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 9948521
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_is

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 389515
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_jv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1163
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 251064
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_kv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 924
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 21735
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_lo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 32652
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mai

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 25
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 299457
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_mrj

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 669
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_my

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 136639
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nap

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 55
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_nl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 20812149
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_or

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 44230
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 20682611
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_pt

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 26920397
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ru

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 115954598
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sd

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 33925
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_sl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 886223
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_su

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 511
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_te

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 312644
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_tl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 294132
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_ug

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 15503
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_vec

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 64
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_war

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 9161
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_deduplicated_yi

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 32919
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_af

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_af')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 201117
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ar

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 16365602
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_av

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_av')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 456
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bar

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 4
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_bh

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 336
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_br

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_br')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 37085
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cbk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_cs

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 21001388
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_de

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_de')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 104913504
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_el

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_el')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 10425596
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_es

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_es')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 88199221
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_fi

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 8557453
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ga

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 83223
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gom

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 640
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 582219
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_hy

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 659430
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ilo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 2638
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ja

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 62721527
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 524591
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_krc

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1581
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ky

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 146993
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_li

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_li')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 137
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lt

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 2977757
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mhr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 3212
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 395605
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mt

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 26598
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mzn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1055
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ne

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 299938
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_no

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_no')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 5546211
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pa

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 127467
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pnb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 4599
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_rm

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 41
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sah

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 22301
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_si

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_si')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 203082
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sq

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 672077
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sw

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 41986
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_th

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_th')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 6064129
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tt

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 135923
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ur

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 638596
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3366
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_xal

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 39
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yue

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 11
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_en

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_en')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Bölünmeler :

Bölmek Örnekler
'train' 455994980
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_eu

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 506883
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_frr

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 7
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_gl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 544388
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_he

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_he')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 3808397
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ht

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 13
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_id

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_id')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 16236463
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_is

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_is')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 625673
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_jv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1445
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kn

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 350363
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_kv

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1549
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lb

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 34807
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_lo

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 52910
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mai

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 123
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mk

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 437871
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_mrj

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 757
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_my

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_my')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 232329
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nap

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 73
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_nl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 34682142
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_or

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_or')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 59463
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 35440972
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_pt

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 42114520
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ru

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 161836003
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sd

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 44280
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_sl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 1746604
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_su

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_su')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Sürüm : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 805
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_te

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_te')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 475703
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_tl

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 458206
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_ug

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 22255
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_vec

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 73
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_war

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_war')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 9760
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

unshuffled_original_yi

Bu veri kümesini TFDS'ye yüklemek için aşağıdaki komutu kullanın:

ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
  • Tanım :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
  • License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.

    Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

    • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
    • Clearly identify the copyrighted work claimed to be infringed.
    • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

    We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

  • Version : 1.0.0

  • Splits :

Bölmek Örnekler
'train' 59364
  • Özellikler :
{
    "id": {
        "dtype": "int64",
        "id": null,
        "_type": "Value"
    },
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}