dolma
Stay organized with collections
Save and categorize content based on your preferences.
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining
Research
Split |
Examples |
'train' |
3,403,336,408 |
FeaturesDict({
'added': Text(shape=(), dtype=string),
'created': Text(shape=(), dtype=string),
'id': Text(shape=(), dtype=string),
'source': Text(shape=(), dtype=string),
'text': Text(shape=(), dtype=string),
})
Feature |
Class |
Shape |
Dtype |
Description |
|
FeaturesDict |
|
|
|
added |
Text |
|
string |
|
created |
Text |
|
string |
|
id |
Text |
|
string |
|
source |
Text |
|
string |
|
text |
Text |
|
string |
|
@article{dolma,
title = { {Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research} },
author = {
Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Ian Magnusson and
Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and
Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and
Oyvind Tafjord and Evan Pete Walsh and Hannaneh Hajishirzi and Noah A. Smith and Luke Zettlemoyer and
Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
},
year = {2024},
journal={arXiv preprint},
}
Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.
Last updated 2025-03-14 UTC.
[null,null,["Last updated 2025-03-14 UTC."],[],[],null,["# dolma\n\n\u003cbr /\u003e\n\n- **Description**:\n\nDolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining\nResearch\n\n- **Homepage** :\n \u003chttps://github.com/allenai/dolma\u003e\n\n- **Source code** :\n [`tfds.datasets.dolma.Builder`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/datasets/dolma/dolma_dataset_builder.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): Initial release.\n- **Download size** : `Unknown size`\n\n- **Dataset size** : `9.61 TiB`\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n No\n\n- **Splits**:\n\n| Split | Examples |\n|-----------|---------------|\n| `'train'` | 3,403,336,408 |\n\n- **Feature structure**:\n\n FeaturesDict({\n 'added': Text(shape=(), dtype=string),\n 'created': Text(shape=(), dtype=string),\n 'id': Text(shape=(), dtype=string),\n 'source': Text(shape=(), dtype=string),\n 'text': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description |\n|---------|--------------|-------|--------|-------------|\n| | FeaturesDict | | | |\n| added | Text | | string | |\n| created | Text | | string | |\n| id | Text | | string | |\n| source | Text | | string | |\n| text | Text | | string | |\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n\nDisplay examples... \n\n- **Citation**:\n\n @article{dolma,\n title = { {Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research} },\n author = {\n Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and\n Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and\n Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Ian Magnusson and\n Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and\n Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and\n Oyvind Tafjord and Evan Pete Walsh and Hannaneh Hajishirzi and Noah A. Smith and Luke Zettlemoyer and\n Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo\n },\n year = {2024},\n journal={arXiv preprint},\n }"]]