- Description:
A colossal, cleaned version of Common Crawl's web crawl corpus.
Based on Common Crawl dataset: https://commoncrawl.org
To generate this dataset, please follow the instructions from t5.
Due to the overhead of cleaning the dataset, it is recommend you prepare it with a distributed service like Cloud Dataflow. More info at https://www.tensorflow.org/datasets/beam_datasets
Homepage: https://github.com/google-research/text-to-text-transfer-transformer#datasets
Source code:
tfds.text.C4
Versions:
2.2.0
: No release notes.2.2.1
: No release notes.2.3.0
: No release notes.2.3.1
: No release notes.3.0.1
(default): No release notes.
Manual download instructions: This dataset requires you to download the source data manually into
download_config.manual_dir
(defaults to~/tensorflow_datasets/downloads/manual/
):
You are using a C4 config that requires some files to be manually downloaded. Forc4/webtextlike
, download OpenWebText.zip from https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQAuto-cached (documentation): No
Feature structure:
FeaturesDict({
'content-length': Text(shape=(), dtype=tf.string),
'content-type': Text(shape=(), dtype=tf.string),
'text': Text(shape=(), dtype=tf.string),
'timestamp': Text(shape=(), dtype=tf.string),
'url': Text(shape=(), dtype=tf.string),
})
- Feature documentation:
Feature | Class | Shape | Dtype | Description |
---|---|---|---|---|
FeaturesDict | ||||
content-length | Text | tf.string | ||
content-type | Text | tf.string | ||
text | Text | tf.string | ||
timestamp | Text | tf.string | ||
url | Text | tf.string |
Supervised keys (See
as_supervised
doc):None
Figure (tfds.show_examples): Not supported.
Citation:
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}
c4/en (default config)
Config description: English C4 dataset.
Download size:
12.28 MiB
Dataset size:
806.92 GiB
Splits:
Split | Examples |
---|---|
'train' |
364,868,901 |
'validation' |
364,608 |
- Examples (tfds.as_dataframe):
c4/en.noclean
Config description: Disables all cleaning (deduplication, removal based on bad words, etc.)
Download size:
12.25 MiB
Dataset size:
6.21 TiB
Splits:
Split | Examples |
---|---|
'train' |
1,063,805,324 |
'validation' |
1,065,029 |
- Examples (tfds.as_dataframe):
c4/realnewslike
Config description: Filters from the default config to only include content from the domains used in the 'RealNews' dataset (Zellers et al., 2019).
Download size:
12.41 MiB
Dataset size:
36.89 GiB
Splits:
Split | Examples |
---|---|
'train' |
13,799,838 |
'validation' |
13,863 |
- Examples (tfds.as_dataframe):
c4/webtextlike
Config description: Filters from the default config to only include content from the URLs in OpenWebText (https://github.com/jcpeterson/openwebtext).
Download size:
14.12 MiB
Dataset size:
18.00 GiB
Splits:
Split | Examples |
---|---|
'train' |
4,500,788 |
'validation' |
4,493 |
- Examples (tfds.as_dataframe):
c4/multilingual
Config description: Multilingual C4 (mC4) has 101 languages and is generated from 71 Common Crawl dumps.
Download size:
22.74 MiB
Dataset size:
26.76 TiB
Splits:
Split | Examples |
---|---|
'af' |
2,152,243 |
'af-validation' |
2,118 |
'am' |
162,870 |
'am-validation' |
155 |
'ar' |
53,256,040 |
'ar-validation' |
52,978 |
'az' |
5,285,720 |
'az-validation' |
5,239 |
'be' |
1,742,030 |
'be-validation' |
1,712 |
'bg' |
23,409,799 |
'bg-Latn' |
162,461 |
'bg-Latn-validation' |
144 |
'bg-validation' |
23,503 |
'bn' |
7,444,098 |
'bn-validation' |
7,415 |
'ca' |
14,492,899 |
'ca-validation' |
14,489 |
'ceb' |
351,894 |
'ceb-validation' |
367 |
'co' |
494,913 |
'co-validation' |
565 |
'cs' |
60,149,680 |
'cs-validation' |
60,462 |
'cy' |
4,131,915 |
'cy-validation' |
4,103 |
'da' |
28,777,331 |
'da-validation' |
28,945 |
'de' |
397,006,993 |
'de-validation' |
398,583 |
'el' |
41,753,736 |
'el-Latn' |
449,943 |
'el-Latn-validation' |
468 |
'el-validation' |
42,358 |
'en' |
3,079,081,989 |
'en-validation' |
3,083,850 |
'eo' |
500,048 |
'eo-validation' |
496 |
'es' |
416,057,992 |
'es-validation' |
416,256 |
'et' |
6,941,360 |
'et-validation' |
6,848 |
'eu' |
1,555,887 |
'eu-validation' |
1,580 |
'fa' |
53,927,287 |
'fa-validation' |
53,685 |
'fi' |
26,842,650 |
'fi-validation' |
26,710 |
'fil' |
2,102,197 |
'fil-validation' |
2,158 |
'fr' |
332,674,575 |
'fr-validation' |
331,328 |
'fy' |
1,104,359 |
'fy-validation' |
1,094 |
'ga' |
465,670 |
'ga-validation' |
490 |
'gd' |
322,404 |
'gd-validation' |
338 |
'gl' |
4,549,465 |
'gl-validation' |
4,631 |
'gu' |
631,600 |
'gu-validation' |
651 |
'ha' |
247,479 |
'ha-validation' |
258 |
'haw' |
84,312 |
'haw-validation' |
86 |
'hi' |
18,507,273 |
'hi-Latn' |
626,154 |
'hi-Latn-validation' |
638 |
'hi-validation' |
18,392 |
'hmn' |
295,549 |
'hmn-validation' |
312 |
'ht' |
269,174 |
'ht-validation' |
281 |
'hu' |
36,819,508 |
'hu-validation' |
36,756 |
'hy' |
2,401,949 |
'hy-validation' |
2,410 |
'id' |
69,625,551 |
'id-validation' |
69,739 |
'ig' |
92,909 |
'ig-validation' |
87 |
'is' |
2,069,293 |
'is-validation' |
2,065 |
'it' |
186,404,508 |
'it-validation' |
186,030 |
'iw' |
12,334,609 |
'iw-validation' |
12,207 |
'ja' |
87,337,884 |
'ja-Latn' |
533,516 |
'ja-Latn-validation' |
506 |
'ja-validation' |
87,420 |
'jv' |
581,528 |
'jv-validation' |
609 |
'ka' |
2,295,551 |
'ka-validation' |
2,279 |
'kk' |
2,392,401 |
'kk-validation' |
2,400 |
'km' |
756,612 |
'km-validation' |
745 |
'kn' |
1,056,849 |
'kn-validation' |
1,039 |
'ko' |
15,602,947 |
'ko-validation' |
15,771 |
'ku' |
298,389 |
'ku-validation' |
298 |
'ky' |
995,539 |
'ky-validation' |
976 |
'la' |
1,674,463 |
'la-validation' |
1,654 |
'lb' |
2,740,336 |
'lb-validation' |
2,692 |
'lo' |
141,776 |
'lo-validation' |
145 |
'lt' |
11,274,295 |
'lt-validation' |
11,245 |
'lv' |
6,414,223 |
'lv-validation' |
6,598 |
'mg' |
345,040 |
'mg-validation' |
367 |
'mi' |
101,169 |
'mi-validation' |
106 |
'mk' |
2,058,417 |
'mk-validation' |
2,054 |
'ml' |
2,044,981 |
'ml-validation' |
2,002 |
'mn' |
2,054,674 |
'mn-validation' |
2,090 |
'mr' |
7,774,331 |
'mr-validation' |
7,928 |
'ms' |
13,180,647 |
'ms-validation' |
13,391 |
'mt' |
2,261,303 |
'mt-validation' |
2,322 |
'my' |
813,530 |
'my-validation' |
858 |
'ne' |
2,942,785 |
'ne-validation' |
2,951 |
'nl' |
96,210,458 |
'nl-validation' |
96,637 |
'no' |
25,402,139 |
'no-validation' |
25,766 |
'ny' |
174,696 |
'ny-validation' |
162 |
'pa' |
363,399 |
'pa-validation' |
346 |
'pl' |
126,164,277 |
'pl-validation' |
125,997 |
'ps' |
335,452 |
'ps-validation' |
318 |
'pt' |
169,239,084 |
'pt-validation' |
169,417 |
'ro' |
45,738,857 |
'ro-validation' |
45,512 |
'ru' |
755,585,265 |
'ru-Latn' |
745,491 |
'ru-Latn-validation' |
753 |
'ru-validation' |
756,418 |
'sd' |
743,057 |
'sd-validation' |
774 |
'si' |
534,759 |
'si-validation' |
509 |
'sk' |
17,729,698 |
'sk-validation' |
17,865 |
'sl' |
8,499,456 |
'sl-validation' |
8,504 |
'sm' |
98,467 |
'sm-validation' |
108 |
'sn' |
326,392 |
'sn-validation' |
306 |
'so' |
893,012 |
'so-validation' |
888 |
'sq' |
4,113,147 |
'sq-validation' |
4,086 |
'sr' |
3,398,483 |
'sr-validation' |
3,443 |
'st' |
66,837 |
'st-validation' |
88 |
'su' |
280,719 |
'su-validation' |
269 |
'sv' |
48,570,979 |
'sv-validation' |
48,633 |
'sw' |
985,654 |
'sw-validation' |
994 |
'ta' |
3,514,561 |
'ta-validation' |
3,510 |
'te' |
1,188,243 |
'te-validation' |
1,211 |
'tg' |
1,280,757 |
'tg-validation' |
1,259 |
'th' |
15,463,131 |
'th-validation' |
15,344 |
'tr' |
87,595,290 |
'tr-validation' |
87,596 |
'uk' |
38,556,465 |
'uk-validation' |
38,550 |
'und' |
1,866,266,695 |
'und-validation' |
1,867,450 |
'ur' |
1,950,124 |
'ur-validation' |
1,885 |
'uz' |
796,416 |
'uz-validation' |
847 |
'vi' |
78,587,159 |
'vi-validation' |
78,611 |
'xh' |
69,048 |
'xh-validation' |
62 |
'yi' |
143,708 |
'yi-validation' |
161 |
'yo' |
46,214 |
'yo-validation' |
42 |
'zh' |
54,542,308 |
'zh-Latn' |
373,664 |
'zh-Latn-validation' |
387 |
'zh-validation' |
54,656 |
'zu' |
555,458 |
'zu-validation' |
548 |
- Examples (tfds.as_dataframe):