dolma

Descrizione :

Dolma: un corpus aperto di tre trilioni di token per la ricerca sulla preformazione dei modelli linguistici

Homepage : https://github.com/allenai/dolma
Codice sorgente : tfds.datasets.dolma.Builder
Versioni :
- 1.0.0 (impostazione predefinita): versione iniziale.
Dimensioni del download : Unknown size
Dimensioni del set di dati : 9.61 TiB
Memorizzazione nella cache automatica ( documentazione ): No
Divide :

Diviso	Esempi
`'train'`	3.403.336.408

Struttura delle caratteristiche :

FeaturesDict({
    'added': Text(shape=(), dtype=string),
    'created': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'source': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
})

Documentazione delle funzionalità :

Caratteristica	Classe	Tipo D
	CaratteristicheDict
aggiunto	Testo	corda
creato	Testo	corda
id	Testo	corda
fonte	Testo	corda
testo	Testo	corda

Chiavi supervisionate (vedi il documento as_supervised ): None
Figura ( tfds.show_examples ): non supportato.
Esempi ( tfds.as_dataframe ):

Citazione :

@article{dolma,
  title = { {Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research} },
  author = {
    Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
    Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
    Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Ian Magnusson and
    Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and
    Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and
    Oyvind Tafjord and Evan Pete Walsh and Hannaneh Hajishirzi and Noah A. Smith and Luke Zettlemoyer and
    Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
},
  year = {2024},
  journal={arXiv preprint},
}

dolma Mantieni tutto organizzato con le raccolte Salva e classifica i contenuti in base alle tue preferenze.

dolma