- Deskripsi :
Versi korpus perayapan web Common Crawl yang sangat besar dan bersih.
Berdasarkan kumpulan data Perayapan Umum: https://commoncrawl.org
Untuk menghasilkan kumpulan data ini, harap ikuti petunjuk dari t5 .
Karena overhead pembersihan set data, sebaiknya Anda menyiapkannya dengan layanan terdistribusi seperti Cloud Dataflow. Info lebih lanjut di https://www.tensorflow.org/datasets/beam_datasets
Dokumentasi Tambahan : Jelajahi di Makalah Dengan Kode
Beranda : https://github.com/google-research/text-to-text-transfer-transformer#datasets
Kode sumber :
tfds.text.C4
Versi :
-
2.2.0
: Tidak ada catatan rilis. -
2.2.1
: Tidak ada catatan rilis. -
2.3.0
: Tidak ada catatan rilis. -
2.3.1
: Tidak ada catatan rilis. -
3.1.0
(default): Tidak ada catatan rilis.
-
Instruksi pengunduhan manual : Kumpulan data ini mengharuskan Anda mengunduh data sumber secara manual ke
download_config.manual_dir
(default ke~/tensorflow_datasets/downloads/manual/
):
Anda menggunakan konfigurasi C4 yang mengharuskan beberapa file diunduh secara manual. Untukc4/webtextlike
, unduh OpenWebText.zip dari https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQDi-cache otomatis ( dokumentasi ): Tidak
Struktur fitur :
FeaturesDict({
'content-length': Text(shape=(), dtype=string),
'content-type': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
'timestamp': Text(shape=(), dtype=string),
'url': Text(shape=(), dtype=string),
})
- Dokumentasi fitur :
Fitur | Kelas | Membentuk | Dtype | Keterangan |
---|---|---|---|---|
fiturDict | ||||
konten-panjang | Teks | rangkaian | ||
Jenis konten | Teks | rangkaian | ||
teks | Teks | rangkaian | ||
cap waktu | Teks | rangkaian | ||
url | Teks | rangkaian |
Kunci yang diawasi (Lihat
as_supervised
doc ):None
Gambar ( tfds.show_examples ): Tidak didukung.
Kutipan :
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}
c4/en (konfigurasi default)
Deskripsi konfigurasi : Dataset C4 bahasa Inggris.
Ukuran unduhan :
201.98 KiB
Ukuran dataset :
806.87 GiB
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 364.613.570 |
'validation' | 364.724 |
- Contoh ( tfds.as_dataframe ):
c4/en.noclean
Deskripsi konfigurasi : Menonaktifkan semua pembersihan (deduplikasi, penghapusan berdasarkan kata-kata buruk, dll.)
Ukuran unduhan :
177.11 KiB
Ukuran dataset :
6.21 TiB
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 1.063.805.169 |
'validation' | 1.065.028 |
- Contoh ( tfds.as_dataframe ):
c4/realnewslike
Deskripsi konfigurasi : Memfilter dari konfigurasi default untuk hanya menyertakan konten dari domain yang digunakan dalam kumpulan data 'RealNews' (Zellers et al., 2019).
Ukuran unduhan :
340.29 KiB
Ukuran dataset :
36.91 GiB
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 13.804.817 |
'validation' | 13.855 |
- Contoh ( tfds.as_dataframe ):
c4/webtextlike
Deskripsi konfigurasi : Memfilter dari konfigurasi default untuk hanya menyertakan konten dari URL di OpenWebText ( https://github.com/jcpeterson/openwebtext ).
Ukuran unduhan :
2.04 MiB
Ukuran dataset :
17.93 GiB
Perpecahan :
Membelah | Contoh |
---|---|
'train' | 4.488.694 |
'validation' | 4.486 |
- Contoh ( tfds.as_dataframe ):
c4/multibahasa
Deskripsi konfigurasi : Multilingual C4 (mC4) memiliki 101 bahasa dan dihasilkan dari 86 dump Common Crawl.
Ukuran unduhan :
13.60 MiB
Ukuran dataset :
38.49 TiB
Perpecahan :
Membelah | Contoh |
---|---|
'af' | 1.770.414 |
'af-validation' | 1.757 |
'am' | 291.570 |
'am-validation' | 289 |
'ar' | 92.455.378 |
'ar-validation' | 92.374 |
'az' | 7.179.300 |
'az-validation' | 7.206 |
'be' | 2.156.584 |
'be-validation' | 2.103 |
'bg' | 32.511.350 |
'bg-Latn' | 44.290 |
'bg-Latn-validation' | 41 |
'bg-validation' | 32.690 |
'bn' | 15.183.514 |
'bn-validation' | 15.130 |
'ca' | 19.438.615 |
'ca-validation' | 19.562 |
'ceb' | 415.208 |
'ceb-validation' | 430 |
'co' | 217.257 |
'co-validation' | 211 |
'cs' | 82.262.078 |
'cs-validation' | 82.594 |
'cy' | 1.066.595 |
'cy-validation' | 1.016 |
'da' | 36.884.558 |
'da-validation' | 37.071 |
'de' | 545.956.997 |
'de-validation' | 547.566 |
'el' | 68.577.376 |
'el-Latn' | 162.004 |
'el-Latn-validation' | 171 |
'el-validation' | 69.435 |
'en' | 3.928.733.379 |
'en-validation' | 3.933.379 |
'eo' | 560.151 |
'eo-validation' | 546 |
'es' | 591.272.119 |
'es-validation' | 592.258 |
'et' | 10.401.882 |
'et-validation' | 10.276 |
'eu' | 2.077.113 |
'eu-validation' | 2.077 |
'fa' | 81.252.911 |
'fa-validation' | 81.034 |
'fi' | 36.807.562 |
'fi-validation' | 36.512 |
'fil' | 2.331.209 |
'fil-validation' | 2.381 |
'fr' | 454.229.019 |
'fr-validation' | 453.124 |
'fy' | 502.656 |
'fy-validation' | 478 |
'ga' | 611.457 |
'ga-validation' | 631 |
'gd' | 201.237 |
'gd-validation' | 196 |
'gl' | 3.762.255 |
'gl-validation' | 3.811 |
'gu' | 1.292.191 |
'gu-validation' | 1.323 |
'ha' | 363.002 |
'ha-validation' | 368 |
'haw' | 103.043 |
'haw-validation' | 99 |
'hi' | 26.695.748 |
'hi-Latn' | 251.231 |
'hi-Latn-validation' | 261 |
'hi-validation' | 26.721 |
'hmn' | 157.016 |
'hmn-validation' | 175 |
'ht' | 232.354 |
'ht-validation' | 246 |
'hu' | 56.645.732 |
'hu-validation' | 56.905 |
'hy' | 3.873.029 |
'hy-validation' | 3.804 |
'id' | 19.423.746 |
'id-validation' | 19.601 |
'ig' | 110.582 |
'ig-validation' | 103 |
'is' | 3.139.312 |
'is-validation' | 3.210 |
'it' | 267.686.115 |
'it-validation' | 267.322 |
'iw' | 17.607.812 |
'iw-validation' | 17.570 |
'ja' | 85.226.039 |
'ja-Latn' | 235.885 |
'ja-Latn-validation' | 221 |
'ja-validation' | 85.618 |
'jv' | 218.969 |
'jv-validation' | 253 |
'ka' | 3.726.808 |
'ka-validation' | 3.752 |
'kk' | 3.421.165 |
'kk-validation' | 3.443 |
'km' | 1.384.128 |
'km-validation' | 1.359 |
'kn' | 1.916.445 |
'kn-validation' | 1.895 |
'ko' | 24.035.493 |
'ko-validation' | 24.240 |
'ku' | 399.027 |
'ku-validation' | 417 |
'ky' | 1.198.504 |
'ky-validation' | 1.188 |
'la' | 1.632.557 |
'la-validation' | 1.630 |
'lb' | 850.921 |
'lb-validation' | 856 |
'lo' | 302.612 |
'lo-validation' | 290 |
'lt' | 18.234.466 |
'lt-validation' | 18.428 |
'lv' | 9.882.376 |
'lv-validation' | 10.034 |
'mg' | 263.321 |
'mg-validation' | 254 |
'mi' | 148.146 |
'mi-validation' | 156 |
'mk' | 3.599.707 |
'mk-validation' | 3.713 |
'ml' | 3.604.562 |
'ml-validation' | 3.514 |
'mn' | 2.947.312 |
'mn-validation' | 3.021 |
'mr' | 4.555.599 |
'mr-validation' | 4.602 |
'ms' | 4.688.036 |
'ms-validation' | 4.719 |
'mt' | 1.109.191 |
'mt-validation' | 1.207 |
'my' | 1.248.242 |
'my-validation' | 1.314 |
'ne' | 4.679.412 |
'ne-validation' | 4.738 |
'nl' | 136.379.427 |
'nl-validation' | 137.142 |
'no' | 30.644.684 |
'no-validation' | 31.134 |
'ny' | 114.952 |
'ny-validation' | 121 |
'pa' | 729.394 |
'pa-validation' | 719 |
'pl' | 178.690.573 |
'pl-validation' | 178.481 |
'ps' | 497.321 |
'ps-validation' | 468 |
'pt' | 246.401.954 |
'pt-validation' | 246.120 |
'ro' | 66.499.899 |
'ro-validation' | 66.384 |
'ru' | 1.014.064.014 |
'ru-Latn' | 582.022 |
'ru-Latn-validation' | 616 |
'ru-validation' | 1.014.169 |
'sd' | 210.835 |
'sd-validation' | 206 |
'si' | 846.125 |
'si-validation' | 846 |
'sk' | 26.721.250 |
'sk-validation' | 26.882 |
'sl' | 12.381.886 |
'sl-validation' | 12.381 |
'sm' | 102.125 |
'sm-validation' | 108 |
'sn' | 124.984 |
'sn-validation' | 116 |
'so' | 1.168.106 |
'so-validation' | 1.212 |
'sq' | 7.023.573 |
'sq-validation' | 7.057 |
'sr' | 4.775.217 |
'sr-validation' | 4.804 |
'st' | 99.970 |
'st-validation' | 103 |
'su' | 153.302 |
'su-validation' | 151 |
'sv' | 63.308.307 |
'sv-validation' | 63.488 |
'sw' | 1.279.408 |
'sw-validation' | 1.296 |
'ta' | 5.769.533 |
'ta-validation' | 5.770 |
'te' | 2.034.828 |
'te-validation' | 2.010 |
'tg' | 1.563.304 |
'tg-validation' | 1.526 |
'th' | 28.021.205 |
'th-validation' | 28.062 |
'tr' | 132.662.955 |
'tr-validation' | 133.062 |
'uk' | 56.159.593 |
'uk-validation' | 56.321 |
'und' | 3.650.492.732 |
'und-validation' | 3.656.588 |
'ur' | 3.432.478 |
'ur-validation' | 3.443 |
'uz' | 1.183.603 |
'uz-validation' | 1.259 |
'vi' | 132.667.573 |
'vi-validation' | 132.915 |
'xh' | 122.232 |
'xh-validation' | 117 |
'yi' | 173.510 |
'yi-validation' | 166 |
'yo' | 86.686 |
'yo-validation' | 82 |
'zh' | 214.856.503 |
'zh-Latn' | 471.314 |
'zh-Latn-validation' | 492 |
'zh-validation' | 214.733 |
'zu' | 261.239 |
'zu-validation' | 253 |
- Contoh ( tfds.as_dataframe ):