c4

  • Deskripsi :

Versi korpus perayapan web Common Crawl yang sangat besar dan bersih.

Berdasarkan kumpulan data Perayapan Umum: https://commoncrawl.org

Untuk menghasilkan kumpulan data ini, harap ikuti petunjuk dari t5 .

Karena overhead pembersihan set data, sebaiknya Anda menyiapkannya dengan layanan terdistribusi seperti Cloud Dataflow. Info lebih lanjut di https://www.tensorflow.org/datasets/beam_datasets

FeaturesDict({
    'content-length': Text(shape=(), dtype=string),
    'content-type': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
    'timestamp': Text(shape=(), dtype=string),
    'url': Text(shape=(), dtype=string),
})
  • Dokumentasi fitur :
Fitur Kelas Membentuk Dtype Keterangan
fiturDict
konten-panjang Teks rangkaian
Jenis konten Teks rangkaian
teks Teks rangkaian
cap waktu Teks rangkaian
url Teks rangkaian
@article{2019t5,
  author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  journal = {arXiv e-prints},
  year = {2019},
  archivePrefix = {arXiv},
  eprint = {1910.10683},
}

c4/en (konfigurasi default)

  • Deskripsi konfigurasi : Dataset C4 bahasa Inggris.

  • Ukuran unduhan : 201.98 KiB

  • Ukuran dataset : 806.87 GiB

  • Perpecahan :

Membelah Contoh
'train' 364.613.570
'validation' 364.724

c4/en.noclean

  • Deskripsi konfigurasi : Menonaktifkan semua pembersihan (deduplikasi, penghapusan berdasarkan kata-kata buruk, dll.)

  • Ukuran unduhan : 177.11 KiB

  • Ukuran dataset : 6.21 TiB

  • Perpecahan :

Membelah Contoh
'train' 1.063.805.169
'validation' 1.065.028

c4/realnewslike

  • Deskripsi konfigurasi : Memfilter dari konfigurasi default untuk hanya menyertakan konten dari domain yang digunakan dalam kumpulan data 'RealNews' (Zellers et al., 2019).

  • Ukuran unduhan : 340.29 KiB

  • Ukuran dataset : 36.91 GiB

  • Perpecahan :

Membelah Contoh
'train' 13.804.817
'validation' 13.855

c4/webtextlike

  • Deskripsi konfigurasi : Memfilter dari konfigurasi default untuk hanya menyertakan konten dari URL di OpenWebText ( https://github.com/jcpeterson/openwebtext ).

  • Ukuran unduhan : 2.04 MiB

  • Ukuran dataset : 17.93 GiB

  • Perpecahan :

Membelah Contoh
'train' 4.488.694
'validation' 4.486

c4/multibahasa

  • Deskripsi konfigurasi : Multilingual C4 (mC4) memiliki 101 bahasa dan dihasilkan dari 86 dump Common Crawl.

  • Ukuran unduhan : 13.60 MiB

  • Ukuran dataset : 38.49 TiB

  • Perpecahan :

Membelah Contoh
'af' 1.770.414
'af-validation' 1.757
'am' 291.570
'am-validation' 289
'ar' 92.455.378
'ar-validation' 92.374
'az' 7.179.300
'az-validation' 7.206
'be' 2.156.584
'be-validation' 2.103
'bg' 32.511.350
'bg-Latn' 44.290
'bg-Latn-validation' 41
'bg-validation' 32.690
'bn' 15.183.514
'bn-validation' 15.130
'ca' 19.438.615
'ca-validation' 19.562
'ceb' 415.208
'ceb-validation' 430
'co' 217.257
'co-validation' 211
'cs' 82.262.078
'cs-validation' 82.594
'cy' 1.066.595
'cy-validation' 1.016
'da' 36.884.558
'da-validation' 37.071
'de' 545.956.997
'de-validation' 547.566
'el' 68.577.376
'el-Latn' 162.004
'el-Latn-validation' 171
'el-validation' 69.435
'en' 3.928.733.379
'en-validation' 3.933.379
'eo' 560.151
'eo-validation' 546
'es' 591.272.119
'es-validation' 592.258
'et' 10.401.882
'et-validation' 10.276
'eu' 2.077.113
'eu-validation' 2.077
'fa' 81.252.911
'fa-validation' 81.034
'fi' 36.807.562
'fi-validation' 36.512
'fil' 2.331.209
'fil-validation' 2.381
'fr' 454.229.019
'fr-validation' 453.124
'fy' 502.656
'fy-validation' 478
'ga' 611.457
'ga-validation' 631
'gd' 201.237
'gd-validation' 196
'gl' 3.762.255
'gl-validation' 3.811
'gu' 1.292.191
'gu-validation' 1.323
'ha' 363.002
'ha-validation' 368
'haw' 103.043
'haw-validation' 99
'hi' 26.695.748
'hi-Latn' 251.231
'hi-Latn-validation' 261
'hi-validation' 26.721
'hmn' 157.016
'hmn-validation' 175
'ht' 232.354
'ht-validation' 246
'hu' 56.645.732
'hu-validation' 56.905
'hy' 3.873.029
'hy-validation' 3.804
'id' 19.423.746
'id-validation' 19.601
'ig' 110.582
'ig-validation' 103
'is' 3.139.312
'is-validation' 3.210
'it' 267.686.115
'it-validation' 267.322
'iw' 17.607.812
'iw-validation' 17.570
'ja' 85.226.039
'ja-Latn' 235.885
'ja-Latn-validation' 221
'ja-validation' 85.618
'jv' 218.969
'jv-validation' 253
'ka' 3.726.808
'ka-validation' 3.752
'kk' 3.421.165
'kk-validation' 3.443
'km' 1.384.128
'km-validation' 1.359
'kn' 1.916.445
'kn-validation' 1.895
'ko' 24.035.493
'ko-validation' 24.240
'ku' 399.027
'ku-validation' 417
'ky' 1.198.504
'ky-validation' 1.188
'la' 1.632.557
'la-validation' 1.630
'lb' 850.921
'lb-validation' 856
'lo' 302.612
'lo-validation' 290
'lt' 18.234.466
'lt-validation' 18.428
'lv' 9.882.376
'lv-validation' 10.034
'mg' 263.321
'mg-validation' 254
'mi' 148.146
'mi-validation' 156
'mk' 3.599.707
'mk-validation' 3.713
'ml' 3.604.562
'ml-validation' 3.514
'mn' 2.947.312
'mn-validation' 3.021
'mr' 4.555.599
'mr-validation' 4.602
'ms' 4.688.036
'ms-validation' 4.719
'mt' 1.109.191
'mt-validation' 1.207
'my' 1.248.242
'my-validation' 1.314
'ne' 4.679.412
'ne-validation' 4.738
'nl' 136.379.427
'nl-validation' 137.142
'no' 30.644.684
'no-validation' 31.134
'ny' 114.952
'ny-validation' 121
'pa' 729.394
'pa-validation' 719
'pl' 178.690.573
'pl-validation' 178.481
'ps' 497.321
'ps-validation' 468
'pt' 246.401.954
'pt-validation' 246.120
'ro' 66.499.899
'ro-validation' 66.384
'ru' 1.014.064.014
'ru-Latn' 582.022
'ru-Latn-validation' 616
'ru-validation' 1.014.169
'sd' 210.835
'sd-validation' 206
'si' 846.125
'si-validation' 846
'sk' 26.721.250
'sk-validation' 26.882
'sl' 12.381.886
'sl-validation' 12.381
'sm' 102.125
'sm-validation' 108
'sn' 124.984
'sn-validation' 116
'so' 1.168.106
'so-validation' 1.212
'sq' 7.023.573
'sq-validation' 7.057
'sr' 4.775.217
'sr-validation' 4.804
'st' 99.970
'st-validation' 103
'su' 153.302
'su-validation' 151
'sv' 63.308.307
'sv-validation' 63.488
'sw' 1.279.408
'sw-validation' 1.296
'ta' 5.769.533
'ta-validation' 5.770
'te' 2.034.828
'te-validation' 2.010
'tg' 1.563.304
'tg-validation' 1.526
'th' 28.021.205
'th-validation' 28.062
'tr' 132.662.955
'tr-validation' 133.062
'uk' 56.159.593
'uk-validation' 56.321
'und' 3.650.492.732
'und-validation' 3.656.588
'ur' 3.432.478
'ur-validation' 3.443
'uz' 1.183.603
'uz-validation' 1.259
'vi' 132.667.573
'vi-validation' 132.915
'xh' 122.232
'xh-validation' 117
'yi' 173.510
'yi-validation' 166
'yo' 86.686
'yo-validation' 82
'zh' 214.856.503
'zh-Latn' 471.314
'zh-Latn-validation' 492
'zh-validation' 214.733
'zu' 261.239
'zu-validation' 253