TFDS now supports the Croissant 🥐 format! Read the documentation to know more.

sift1m

Description:

Pre-trained embeddings for approximate nearest neighbor search using the Euclidean distance. This dataset consists of two splits:

'database': consists of 1,000,000 data points, each has features: 'embedding' (128 floats), 'index' (int64), 'neighbors' (empty list).
'test': consists of 10,000 data points, each has features: 'embedding' (128 floats), 'index' (int64), 'neighbors' (list of 'index' and 'distance' of the nearest neighbors in the database.)

Homepage: http://corpus-texmex.irisa.fr/
Source code: tfds.datasets.sift1m.Builder
Versions:
- 1.0.0 (default): Initial release.
Download size: 500.80 MiB
Dataset size: 589.49 MiB
Auto-cached (documentation): No
Splits:

Split	Examples
`'database'`	1,000,000
`'test'`	10,000

Feature structure:

FeaturesDict({
    'embedding': Tensor(shape=(128,), dtype=float32),
    'index': Scalar(shape=(), dtype=int64, description=Index within the split.),
    'neighbors': Sequence({
        'distance': Scalar(shape=(), dtype=float32, description=Neighbor distance.),
        'index': Scalar(shape=(), dtype=int64, description=Neighbor index.),
    }),
})

Feature documentation:

Feature	Class	Shape	Dtype	Description
	FeaturesDict
embedding	Tensor	(128,)	float32
index	Scalar		int64	Index within the split.
neighbors	Sequence			The computed neighbors, which is only available for the test split.
neighbors/distance	Scalar		float32	Neighbor distance.
neighbors/index	Scalar		int64	Neighbor index.

Supervised keys (See as_supervised doc): None
Figure (tfds.show_examples): Not supported.
Examples (tfds.as_dataframe):

Citation:

@article{jegou2010product,
  title={Product quantization for nearest neighbor search},
  author={Jegou, Herve and Douze, Matthijs and Schmid, Cordelia},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={33},
  number={1},
  pages={117--128},
  year={2010},
  publisher={IEEE}
}

sift1m Stay organized with collections Save and categorize content based on your preferences.

sift1m