dolma

Opis :

Dolma: otwarty korpus trzech bilionów tokenów do badań przed szkoleniem modelu językowego

Strona główna : https://github.com/allenai/dolma
Kod źródłowy : tfds.datasets.dolma.Builder
Wersje :
- 1.0.0 (domyślnie): Wersja pierwsza.
Rozmiar pobierania : Unknown size
Rozmiar zbioru danych : 9.61 TiB
Automatyczne buforowanie ( dokumentacja ): Nie
Podziały :

Podział	Przykłady
`'train'`	3 403 336 408

Struktura funkcji :

FeaturesDict({
    'added': Text(shape=(), dtype=string),
    'created': Text(shape=(), dtype=string),
    'id': Text(shape=(), dtype=string),
    'source': Text(shape=(), dtype=string),
    'text': Text(shape=(), dtype=string),
})

Dokumentacja funkcji :

Funkcja	Klasa	Typ D
	FunkcjeDykt
w dodatku	Tekst	smyczkowy
stworzony	Tekst	smyczkowy
id	Tekst	smyczkowy
źródło	Tekst	smyczkowy
tekst	Tekst	smyczkowy

Klucze nadzorowane (zobacz dokument as_supervised ): None
Rysunek ( tfds.show_examples ): Nieobsługiwany.
Przykłady ( tfds.as_dataframe ):

Cytat :

@article{dolma,
  title = { {Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research} },
  author = {
    Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
    Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
    Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Ian Magnusson and
    Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and
    Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and
    Oyvind Tafjord and Evan Pete Walsh and Hannaneh Hajishirzi and Noah A. Smith and Luke Zettlemoyer and
    Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
},
  year = {2024},
  journal={arXiv preprint},
}

dolma Zadbaj o dobrą organizację dzięki kolekcji Zapisuj i kategoryzuj treści zgodnie ze swoimi preferencjami.

dolma