• Description:

Pre-trained Global Vectors for Word Representation (GloVe) embeddings for approximate nearest neighbor search. This dataset consists of two splits:

  1. 'database': consists of 1,183,514 data points, each has features: 'embedding' (100 floats), 'index' (int64), 'neighbors' (empty list).
  2. 'test': consists of 10,000 data points, each has features: 'embedding' (100 floats), 'index' (int64), 'neighbors' (list of 'index' and 'distance' of the nearest neighbors in the database.)
Split Examples
'database' 1,183,514
'test' 10,000
  • Feature structure:
    'embedding': Tensor(shape=(100,), dtype=float32),
    'index': Scalar(shape=(), dtype=int64),
    'neighbors': Sequence({
        'distance': Scalar(shape=(), dtype=float32),
        'index': Scalar(shape=(), dtype=int64),
  • Feature documentation:
Feature Class Shape Dtype Description
embedding Tensor (100,) float32
index Scalar int64 Index within the split.
neighbors Sequence The computed neighbors, which is only available for the test split.
neighbors/distance Scalar float32 Neighbor distance.
neighbors/index Scalar int64 Neighbor index.
  • Citation:
  author = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},
  booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},
  title = {GloVe: Global Vectors for Word Representation},
  year = {2014},
  pages = {1532--1543},
  url = {},