TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

dolma

Description:

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Homepage: https://github.com/allenai/dolma
Source code: tfds.datasets.dolma.Builder
Versions:
- 1.0.0 (default): Initial release.
Download size: Unknown size
Dataset size: 9.61 TiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'train'`	3,403,336,408

Feature structure:

FeaturesDict({
    'added': Text(shape=(), dtype=string),
    'created': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'source': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
})

Feature documentation:

Feature	Class	Dtype
	FeaturesDict
added	Text	string
created	Text	string
id	Text	string
source	Text	string
text	Text	string

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):

Citation:

@article{dolma,
  title = { {Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research} },
  author = {
    Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
    Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
    Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Ian Magnusson and
    Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and
    Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and
    Oyvind Tafjord and Evan Pete Walsh and Hannaneh Hajishirzi and Noah A. Smith and Luke Zettlemoyer and
    Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
},
  year = {2024},
  journal={arXiv preprint},
}

dolma Stay organized with collections Save and categorize content based on your preferences.

dolma