dolma

Description :

Dolma : un corpus ouvert de trois milliards de jetons pour la recherche sur la pré-formation des modèles linguistiques

Page d'accueil : https://github.com/allenai/dolma
Code source : tfds.datasets.dolma.Builder
Versions :
- 1.0.0 (par défaut) : version initiale.
Taille du téléchargement : Unknown size
Taille du jeu de données : 9.61 TiB
Mise en cache automatique ( documentation ) : Non
Divisions :

Diviser	Exemples
`'train'`	3 403 336 408

Structure des fonctionnalités :

FeaturesDict({
    'added': Text(shape=(), dtype=string),
    'created': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'source': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
})

Documentation des fonctionnalités :

Fonctionnalité	Classe	Type D
	FonctionnalitésDict
ajouté	Texte	chaîne
créé	Texte	chaîne
identifiant	Texte	chaîne
source	Texte	chaîne
texte	Texte	chaîne

Clés supervisées (Voir doc as_supervised ) : None
Figure ( tfds.show_examples ) : non pris en charge.
Exemples ( tfds.as_dataframe ) :

Citation :

@article{dolma,
  title = { {Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research} },
  author = {
    Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
    Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
    Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Ian Magnusson and
    Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and
    Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and
    Oyvind Tafjord and Evan Pete Walsh and Hannaneh Hajishirzi and Noah A. Smith and Luke Zettlemoyer and
    Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
},
  year = {2024},
  journal={arXiv preprint},
}

dolma Restez organisé à l'aide des collections Enregistrez et classez les contenus selon vos préférences.

dolma