- Description:
A colossal, cleaned version of Common Crawl's web crawl corpus.
Based on Common Crawl dataset: https://commoncrawl.org
To generate this dataset, please follow the instructions from t5.
Due to the overhead of cleaning the dataset, it is recommend you prepare it with a distributed service like Cloud Dataflow. More info at https://www.tensorflow.org/datasets/beam_datasets
Additional Documentation: Explore on Papers With Code
Homepage: https://github.com/google-research/text-to-text-transfer-transformer#datasets
Source code:
tfds.text.C4
Versions:
2.2.0
: No release notes.2.2.1
: No release notes.2.3.0
: No release notes.2.3.1
: No release notes.3.1.0
(default): No release notes.
Manual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir
(defaults to~/tensorflow_datasets/downloads/manual/
):
You are using a C4 config that requires some files to be manually downloaded. Forc4/webtextlike
, download OpenWebText.zip from https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQAuto-cached (documentation): No
Feature structure:
FeaturesDict({
'content-length': Text(shape=(), dtype=string),
'content-type': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
'timestamp': Text(shape=(), dtype=string),
'url': Text(shape=(), dtype=string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
content-length | Text | string | ||
content-type | Text | string | ||
text | Text | string | ||
timestamp | Text | string | ||
url | Text | string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}
c4/en (default config)
Config description: English C4 dataset.
Download size:
201.98 KiB
Dataset size:
806.87 GiB
Splits:
Split | Examples |
---|---|
'train' |
364,613,570 |
'validation' |
364,724 |
- Examples (tfds.as_dataframe):
c4/en.noclean
Config description: Disables all cleaning (deduplication, removal based on bad words, etc.)
Download size:
177.11 KiB
Dataset size:
6.21 TiB
Splits:
Split | Examples |
---|---|
'train' |
1,063,805,169 |
'validation' |
1,065,028 |
- Examples (tfds.as_dataframe):
c4/realnewslike
Config description: Filters from the default config to only include content from the domains used in the 'RealNews' dataset (Zellers et al., 2019).
Download size:
340.29 KiB
Dataset size:
36.91 GiB
Splits:
Split | Examples |
---|---|
'train' |
13,804,817 |
'validation' |
13,855 |
- Examples (tfds.as_dataframe):
c4/webtextlike
Config description: Filters from the default config to only include content from the URLs in OpenWebText (https://github.com/jcpeterson/openwebtext).
Download size:
2.04 MiB
Dataset size:
17.93 GiB
Splits:
Split | Examples |
---|---|
'train' |
4,488,694 |
'validation' |
4,486 |
- Examples (tfds.as_dataframe):
c4/multilingual
Config description: Multilingual C4 (mC4) has 101 languages and is generated from 86 Common Crawl dumps.
Download size:
13.60 MiB
Dataset size:
38.49 TiB
Splits:
Split | Examples |
---|---|
'af' |
1,770,414 |
'af-validation' |
1,757 |
'am' |
291,570 |
'am-validation' |
289 |
'ar' |
92,455,378 |
'ar-validation' |
92,374 |
'az' |
7,179,300 |
'az-validation' |
7,206 |
'be' |
2,156,584 |
'be-validation' |
2,103 |
'bg' |
32,511,350 |
'bg-Latn' |
44,290 |
'bg-Latn-validation' |
41 |
'bg-validation' |
32,690 |
'bn' |
15,183,514 |
'bn-validation' |
15,130 |
'ca' |
19,438,615 |
'ca-validation' |
19,562 |
'ceb' |
415,208 |
'ceb-validation' |
430 |
'co' |
217,257 |
'co-validation' |
211 |
'cs' |
82,262,078 |
'cs-validation' |
82,594 |
'cy' |
1,066,595 |
'cy-validation' |
1,016 |
'da' |
36,884,558 |
'da-validation' |
37,071 |
'de' |
545,956,997 |
'de-validation' |
547,566 |
'el' |
68,577,376 |
'el-Latn' |
162,004 |
'el-Latn-validation' |
171 |
'el-validation' |
69,435 |
'en' |
3,928,733,379 |
'en-validation' |
3,933,379 |
'eo' |
560,151 |
'eo-validation' |
546 |
'es' |
591,272,119 |
'es-validation' |
592,258 |
'et' |
10,401,882 |
'et-validation' |
10,276 |
'eu' |
2,077,113 |
'eu-validation' |
2,077 |
'fa' |
81,252,911 |
'fa-validation' |
81,034 |
'fi' |
36,807,562 |
'fi-validation' |
36,512 |
'fil' |
2,331,209 |
'fil-validation' |
2,381 |
'fr' |
454,229,019 |
'fr-validation' |
453,124 |
'fy' |
502,656 |
'fy-validation' |
478 |
'ga' |
611,457 |
'ga-validation' |
631 |
'gd' |
201,237 |
'gd-validation' |
196 |
'gl' |
3,762,255 |
'gl-validation' |
3,811 |
'gu' |
1,292,191 |
'gu-validation' |
1,323 |
'ha' |
363,002 |
'ha-validation' |
368 |
'haw' |
103,043 |
'haw-validation' |
99 |
'hi' |
26,695,748 |
'hi-Latn' |
251,231 |
'hi-Latn-validation' |
261 |
'hi-validation' |
26,721 |
'hmn' |
157,016 |
'hmn-validation' |
175 |
'ht' |
232,354 |
'ht-validation' |
246 |
'hu' |
56,645,732 |
'hu-validation' |
56,905 |
'hy' |
3,873,029 |
'hy-validation' |
3,804 |
'id' |
19,423,746 |
'id-validation' |
19,601 |
'ig' |
110,582 |
'ig-validation' |
103 |
'is' |
3,139,312 |
'is-validation' |
3,210 |
'it' |
267,686,115 |
'it-validation' |
267,322 |
'iw' |
17,607,812 |
'iw-validation' |
17,570 |
'ja' |
85,226,039 |
'ja-Latn' |
235,885 |
'ja-Latn-validation' |
221 |
'ja-validation' |
85,618 |
'jv' |
218,969 |
'jv-validation' |
253 |
'ka' |
3,726,808 |
'ka-validation' |
3,752 |
'kk' |
3,421,165 |
'kk-validation' |
3,443 |
'km' |
1,384,128 |
'km-validation' |
1,359 |
'kn' |
1,916,445 |
'kn-validation' |
1,895 |
'ko' |
24,035,493 |
'ko-validation' |
24,240 |
'ku' |
399,027 |
'ku-validation' |
417 |
'ky' |
1,198,504 |
'ky-validation' |
1,188 |
'la' |
1,632,557 |
'la-validation' |
1,630 |
'lb' |
850,921 |
'lb-validation' |
856 |
'lo' |
302,612 |
'lo-validation' |
290 |
'lt' |
18,234,466 |
'lt-validation' |
18,428 |
'lv' |
9,882,376 |
'lv-validation' |
10,034 |
'mg' |
263,321 |
'mg-validation' |
254 |
'mi' |
148,146 |
'mi-validation' |
156 |
'mk' |
3,599,707 |
'mk-validation' |
3,713 |
'ml' |
3,604,562 |
'ml-validation' |
3,514 |
'mn' |
2,947,312 |
'mn-validation' |
3,021 |
'mr' |
4,555,599 |
'mr-validation' |
4,602 |
'ms' |
4,688,036 |
'ms-validation' |
4,719 |
'mt' |
1,109,191 |
'mt-validation' |
1,207 |
'my' |
1,248,242 |
'my-validation' |
1,314 |
'ne' |
4,679,412 |
'ne-validation' |
4,738 |
'nl' |
136,379,427 |
'nl-validation' |
137,142 |
'no' |
30,644,684 |
'no-validation' |
31,134 |
'ny' |
114,952 |
'ny-validation' |
121 |
'pa' |
729,394 |
'pa-validation' |
719 |
'pl' |
178,690,573 |
'pl-validation' |
178,481 |
'ps' |
497,321 |
'ps-validation' |
468 |
'pt' |
246,401,954 |
'pt-validation' |
246,120 |
'ro' |
66,499,899 |
'ro-validation' |
66,384 |
'ru' |
1,014,064,014 |
'ru-Latn' |
582,022 |
'ru-Latn-validation' |
616 |
'ru-validation' |
1,014,169 |
'sd' |
210,835 |
'sd-validation' |
206 |
'si' |
846,125 |
'si-validation' |
846 |
'sk' |
26,721,250 |
'sk-validation' |
26,882 |
'sl' |
12,381,886 |
'sl-validation' |
12,381 |
'sm' |
102,125 |
'sm-validation' |
108 |
'sn' |
124,984 |
'sn-validation' |
116 |
'so' |
1,168,106 |
'so-validation' |
1,212 |
'sq' |
7,023,573 |
'sq-validation' |
7,057 |
'sr' |
4,775,217 |
'sr-validation' |
4,804 |
'st' |
99,970 |
'st-validation' |
103 |
'su' |
153,302 |
'su-validation' |
151 |
'sv' |
63,308,307 |
'sv-validation' |
63,488 |
'sw' |
1,279,408 |
'sw-validation' |
1,296 |
'ta' |
5,769,533 |
'ta-validation' |
5,770 |
'te' |
2,034,828 |
'te-validation' |
2,010 |
'tg' |
1,563,304 |
'tg-validation' |
1,526 |
'th' |
28,021,205 |
'th-validation' |
28,062 |
'tr' |
132,662,955 |
'tr-validation' |
133,062 |
'uk' |
56,159,593 |
'uk-validation' |
56,321 |
'und' |
3,650,492,732 |
'und-validation' |
3,656,588 |
'ur' |
3,432,478 |
'ur-validation' |
3,443 |
'uz' |
1,183,603 |
'uz-validation' |
1,259 |
'vi' |
132,667,573 |
'vi-validation' |
132,915 |
'xh' |
122,232 |
'xh-validation' |
117 |
'yi' |
173,510 |
'yi-validation' |
166 |
'yo' |
86,686 |
'yo-validation' |
82 |
'zh' |
214,856,503 |
'zh-Latn' |
471,314 |
'zh-Latn-validation' |
492 |
'zh-validation' |
214,733 |
'zu' |
261,239 |
'zu-validation' |
253 |
- Examples (tfds.as_dataframe):