dolma

Descripción :

Dolma: un corpus abierto de tres billones de tokens para la investigación previa al entrenamiento de modelos lingüísticos

Página de inicio : https://github.com/allnai/dolma
Código fuente : tfds.datasets.dolma.Builder
Versiones :
- 1.0.0 (predeterminado): versión inicial.
Tamaño de descarga : Unknown size
Tamaño del conjunto de datos : 9.61 TiB
Almacenamiento en caché automático ( documentación ): No
Divisiones :

Dividir	Ejemplos
`'train'`	3.403.336.408

Estructura de características :

FeaturesDict({
    'added': Text(shape=(), dtype=string),
    'created': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'source': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
})

Documentación de funciones :

Característica	Clase	tipo D
	FuncionesDict
agregado	Texto	cadena
creado	Texto	cadena
identificación	Texto	cadena
fuente	Texto	cadena
texto	Texto	cadena

Claves supervisadas (ver documento as_supervised ): None
Figura ( tfds.show_examples ): no compatible.
Ejemplos ( tfds.as_dataframe ):

Cita :

@article{dolma,
  title = { {Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research} },
  author = {
    Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
    Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
    Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Ian Magnusson and
    Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and
    Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and
    Oyvind Tafjord and Evan Pete Walsh and Hannaneh Hajishirzi and Noah A. Smith and Luke Zettlemoyer and
    Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
},
  year = {2024},
  journal={arXiv preprint},
}

dolma Organízate con las colecciones Guarda y clasifica el contenido según tus preferencias.

dolma