• Description:

Pre-trained embeddings for approximate nearest neighbor search using the cosine distance. This dataset consists of two splits:

  1. 'database': consists of 9,990,000 data points, each has features: 'embedding' (96 floats), 'index' (int64), 'neighbors' (empty list).
  2. 'test': consists of 10,000 data points, each has features: 'embedding' (96 floats), 'index' (int64), 'neighbors' (list of 'index' and 'distance' of the nearest neighbors in the database.)
Split Examples
'database' 9,990,000
'test' 10,000
  • Feature structure:
    'embedding': Tensor(shape=(96,), dtype=float32),
    'index': Scalar(shape=(), dtype=int64),
    'neighbors': Sequence({
        'distance': Scalar(shape=(), dtype=float32),
        'index': Scalar(shape=(), dtype=int64),
  • Feature documentation:
Feature Class Shape Dtype Description
embedding Tensor (96,) float32
index Scalar int64 Index within the split.
neighbors Sequence The computed neighbors, which is only available for the test split.
neighbors/distance Scalar float32 Neighbor distance.
neighbors/index Scalar int64 Neighbor index.
  • Citation:
  title={Efficient indexing of billion-scale datasets of deep descriptors},
  author={Babenko, Artem and Lempitsky, Victor},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},