Búsqueda semántica con vecinos más cercanos aproximados e incrustaciones de texto

Este tutorial ilustra cómo generar incrustaciones desde un módulo TensorFlow Hub (TF-Hub) dados los datos de entrada y crear un índice aproximado de vecinos más cercanos (ANN) utilizando las incrustaciones extraídas. Luego, el índice se puede utilizar para la comparación y recuperación de similitudes en tiempo real.

Cuando se trata de un gran corpus de datos, no es eficiente realizar una coincidencia exacta escaneando todo el repositorio para encontrar los elementos más similares a una consulta determinada en tiempo real. Por lo tanto, utilizamos un algoritmo de coincidencia de similitud aproximada que nos permite compensar un poco de precisión en la búsqueda de coincidencias exactas del vecino más cercano para obtener un aumento significativo en la velocidad.

En este tutorial, mostramos un ejemplo de búsqueda de texto en tiempo real sobre un corpus de titulares de noticias para encontrar los titulares que más se parecen a una consulta. A diferencia de la búsqueda de palabras clave, esto captura la similitud semántica codificada en el texto incrustado.

Los pasos de este tutorial son:

  1. Descargue datos de muestra.
  2. Genere incrustaciones para los datos usando un módulo TF-Hub
  3. Cree un índice ANN para las incrustaciones.
  4. Utilice el índice para comparar similitudes

Usamos Apache Beam con TensorFlow Transform (TF-Transform) para generar las incrustaciones desde el módulo TF-Hub. También utilizamos la biblioteca ANNOY de Spotify para crear el índice aproximado de vecinos más cercanos. Puede encontrar evaluaciones comparativas del marco ANN en este repositorio de Github .

Este tutorial utiliza TensorFlow 1.0 y funciona solo con módulos TF1 Hub de TF-Hub. Consulte la versión TF2 actualizada de este tutorial .


Instale las bibliotecas necesarias.

pip install -q apache_beam
pip install -q sklearn
pip install -q annoy

Importar las bibliotecas requeridas

import os
import sys
import pathlib
import pickle
from collections import namedtuple
from datetime import datetime

import numpy as np
import apache_beam as beam
import annoy
from sklearn.random_projection import gaussian_random_matrix

import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
# TFT needs to be installed afterwards
!pip install -q tensorflow_transform==0.24
import tensorflow_transform as tft
import tensorflow_transform.beam as tft_beam
print('TF version: {}'.format(tf.__version__))
print('TF-Hub version: {}'.format(hub.__version__))
print('TF-Transform version: {}'.format(tft.__version__))
print('Apache Beam version: {}'.format(beam.__version__))
TF version: 2.3.1
TF-Hub version: 0.10.0
TF-Transform version: 0.24.0
Apache Beam version: 2.25.0

1. Descargar datos de muestra

El conjunto de datos A Million News Headlines contiene titulares de noticias publicados durante un período de 15 años procedentes de la prestigiosa Australian Broadcasting Corp. (ABC). Este conjunto de datos de noticias tiene un registro histórico resumido de eventos notables en el mundo desde principios de 2003 hasta finales de 2017 con un enfoque más granular en Australia.

Formato : datos de dos columnas separados por tabulaciones: 1) fecha de publicación y 2) texto del título. Sólo nos interesa el texto del título.

wget 'https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true' -O raw.tsv
wc -l raw.tsv
head raw.tsv
--2020-12-03 12:12:21--  https://dataverse.harvard.edu/api/access/datafile/3450625?format=tab&gbrecs=true
Resolving dataverse.harvard.edu (dataverse.harvard.edu)...
Connecting to dataverse.harvard.edu (dataverse.harvard.edu)||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 57600231 (55M) [text/tab-separated-values]
Saving to: ‘raw.tsv’

raw.tsv             100%[===================>]  54.93M  15.1MB/s    in 4.3s    

2020-12-03 12:12:27 (12.7 MB/s) - ‘raw.tsv’ saved [57600231/57600231]

1103664 raw.tsv
publish_date    headline_text
20030219    "aba decides against community broadcasting licence"
20030219    "act fire witnesses must be aware of defamation"
20030219    "a g calls for infrastructure protection summit"
20030219    "air nz staff in aust strike for pay rise"
20030219    "air nz strike to affect australian travellers"
20030219    "ambitious olsson wins triple jump"
20030219    "antic delighted with record breaking barca"
20030219    "aussie qualifier stosur wastes four memphis match"
20030219    "aust addresses un security council over iraq"

Para simplificar, solo mantenemos el texto del título y eliminamos la fecha de publicación.

!rm -r corpus
!mkdir corpus

with open('corpus/text.txt', 'w') as out_file:
with open('raw.tsv', 'r') as in_file:
for line in in_file:
= line.split('\t')[1].strip().strip('"')
tail corpus/text.txt
severe storms forecast for nye in south east queensland
snake catcher pleads for people not to kill reptiles
south australia prepares for party to welcome new year
strikers cool off the heat with big win in adelaide
stunning images from the sydney to hobart yacht
the ashes smiths warners near miss liven up boxing day test
timelapse: brisbanes new year fireworks
what 2017 meant to the kids of australia
what the papodopoulos meeting may mean for ausus
who is george papadopoulos the former trump campaign aide

Función auxiliar para cargar un módulo TF-Hub

def load_module(module_url):
= hub.Module(module_url)
= tf.placeholder(dtype=tf.string)
= embed_module(placeholder)
= tf.Session()
.run([tf.global_variables_initializer(), tf.tables_initializer()])
print('TF-Hub module is loaded.')

def _embeddings_fn(sentences):
= session.run(
, feed_dict={placeholder: sentences})
return computed_embeddings

return _embeddings_fn

2. Genere incrustaciones para los datos.

En este tutorial, utilizamos Universal Sentence Encoder para generar incrustaciones para los datos de los titulares. Las incrustaciones de oraciones se pueden usar fácilmente para calcular la similitud de significado a nivel de oración. Ejecutamos el proceso de generación de incrustación utilizando Apache Beam y TF-Transform.

Método de extracción por incrustación

encoder = None

def embed_text(text, module_url, random_projection_matrix):
# Beam will run this function in different processes that need to
# import hub and load embed_fn (if not previously loaded)
global encoder
if not encoder:
= hub.Module(module_url)
= encoder(text)
if random_projection_matrix is not None:
# Perform random projection for the embedding
= tf.matmul(
, tf.cast(random_projection_matrix, embedding.dtype))
return embedding

Crear el método TFT preprocess_fn

def make_preprocess_fn(module_url, random_projection_matrix=None):
'''Makes a tft preprocess_fn'''

def _preprocess_fn(input_features):
'''tft preprocess_fn'''
= input_features['text']
# Generate the embedding for the input text
= embed_text(text, module_url, random_projection_matrix)

= {
'text': text,
'embedding': embedding

return output_features

return _preprocess_fn

Crear metadatos de conjunto de datos

def create_metadata():
'''Creates metadata for the raw data'''
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
= {'text': tf.FixedLenFeature([], dtype=tf.string)}
= schema_utils.schema_from_feature_spec(feature_spec)
= dataset_metadata.DatasetMetadata(schema)
return metadata

Tubería de haz

def run_hub2emb(args):
'''Runs the embedding generation pipeline'''

= beam.options.pipeline_options.PipelineOptions(**args)
= namedtuple("options", args.keys())(*args.values())

= create_metadata()
= tft.coders.CsvCoder(
=['text'], schema=raw_metadata.schema)

with beam.Pipeline(args.runner, options=options) as pipeline:
with tft_beam.Context(args.temporary_dir):
# Read the sentences from the input file
= (
| 'Read sentences from files' >> beam.io.ReadFromText(
| 'Convert to dictionary' >> beam.Map(converter.decode)

= (sentences, raw_metadata)
= make_preprocess_fn(args.module_url, args.random_projection_matrix)
# Generate the embeddings for the sentence using the TF-Hub module
, _ = (
| 'Extract embeddings' >> tft_beam.AnalyzeAndTransformDataset(preprocess_fn)

, transformed_metadata = embeddings_dataset
# Write the embeddings to TFRecords files
| 'Write embeddings to TFRecords' >> beam.io.tfrecordio.WriteToTFRecord(

Generación de matriz de peso de proyección aleatoria

La proyección aleatoria es una técnica simple pero poderosa que se utiliza para reducir la dimensionalidad de un conjunto de puntos que se encuentran en el espacio euclidiano. Para obtener información teórica, consulte el lema de Johnson-Lindenstrauss .

Reducir la dimensionalidad de las incrustaciones con proyección aleatoria significa menos tiempo necesario para construir y consultar el índice ANN.

En este tutorial utilizamos la proyección aleatoria gaussiana de la biblioteca Scikit-learn .

def generate_random_projection_weights(original_dim, projected_dim):
= None
if projected_dim and original_dim > projected_dim:
= gaussian_random_matrix(
=projected_dim, n_features=original_dim).T
print("A Gaussian random weight matrix was creates with shape of {}".format(random_projection_matrix.shape))
print('Storing random projection matrix to disk...')
with open('random_projection_matrix', 'wb') as handle:
, protocol=pickle.HIGHEST_PROTOCOL)

return random_projection_matrix

Establecer parámetros

Si desea crear un índice utilizando el espacio de incrustación original sin proyección aleatoria, establezca el parámetro projected_dim en None . Tenga en cuenta que esto ralentizará el paso de indexación para incrustaciones de alta dimensión.

module_url = 'https://tfhub.dev/google/universal-sentence-encoder/2'
= 64

Ejecutar canalización

import tempfile

= pathlib.Path(tempfile.mkdtemp())
= pathlib.Path(tempfile.mkdtemp())

= tf.Graph()
with g.as_default():
= load_module(module_url)(['']).shape[1]
= None

if projected_dim:
= generate_random_projection_weights(
, projected_dim)

= {
'job_name': 'hub2emb-{}'.format(datetime.utcnow().strftime('%y%m%d-%H%M%S')),
'runner': 'DirectRunner',
'batch_size': 1024,
'data_dir': 'corpus/*.txt',
'output_dir': output_dir,
'temporary_dir': temporary_dir,
'module_url': module_url,
'random_projection_matrix': random_projection_matrix,

print("Pipeline args are set.")
INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

TF-Hub module is loaded.
A Gaussian random weight matrix was creates with shape of (512, 64)
Storing random projection matrix to disk...
Pipeline args are set.

{'job_name': 'hub2emb-201203-121305',
 'runner': 'DirectRunner',
 'batch_size': 1024,
 'data_dir': 'corpus/*.txt',
 'output_dir': PosixPath('/tmp/tmp3_9agsp3'),
 'temporary_dir': PosixPath('/tmp/tmp75ty7xfk'),
 'module_url': 'https://tfhub.dev/google/universal-sentence-encoder/2',
 'random_projection_matrix': array([[ 0.21470759, -0.05258816, -0.0972597 , ...,  0.04385087,
         -0.14274348,  0.11220471],
        [ 0.03580492, -0.16426251, -0.14089037, ...,  0.0101535 ,
         -0.22515438, -0.21514454],
        [-0.15639698,  0.01808027, -0.13684782, ...,  0.11841098,
         -0.04303762,  0.00745478],
        [-0.18584684,  0.14040793,  0.18339619, ...,  0.13763638,
         -0.13028201, -0.16183348],
        [ 0.20997704, -0.2241034 , -0.12709368, ..., -0.03352462,
          0.11281993, -0.16342795],
        [-0.23761595,  0.00275779, -0.1585855 , ..., -0.08995121,
          0.1475089 , -0.26595401]])}
!rm -r {output_dir}
!rm -r {temporary_dir}

print("Running pipeline...")
%time run_hub2emb(args)
print("Pipeline is done.")
Running pipeline...

Running pipeline...

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).

Warning:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow/python/saved_model/signature_def_utils_impl.py:201: build_tensor_info (from tensorflow.python.saved_model.utils_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.

INFO:tensorflow:Assets added to graph.

INFO:tensorflow:Assets added to graph.

INFO:tensorflow:No assets to write.

INFO:tensorflow:No assets to write.

INFO:tensorflow:SavedModel written to: /tmp/tmp75ty7xfk/tftransform_tmp/0839c04b1a8d4dd0b3d2832fbe9f5904/saved_model.pb

INFO:tensorflow:SavedModel written to: /tmp/tmp75ty7xfk/tftransform_tmp/0839c04b1a8d4dd0b3d2832fbe9f5904/saved_model.pb

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.
Instructions for updating:
Use ref() instead.

Warning:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.6/site-packages/tensorflow_transform/tf_utils.py:218: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:Tensorflow version (2.3.1) found. Note that Tensorflow Transform support for TF 2.0 is currently in beta, and features such as tf.function may not work as intended. 

Warning:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).

Warning:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:apache_beam.io.tfrecordio:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

CPU times: user 2min 50s, sys: 6.6 s, total: 2min 57s
Wall time: 2min 40s
Wall time: 2min 40s
Pipeline is done.

ls {output_dir}

Lea algunas de las incrustaciones generadas...

import itertools

= os.path.join(output_dir, 'emb-00000-of-00001.tfrecords')
= 5
=  tf.io.tf_record_iterator(path=embed_file)
for string_record in itertools.islice(record_iterator, sample):
= tf.train.Example()
= example.features.feature['text'].bytes_list.value
= np.array(example.features.feature['embedding'].float_list.value)
print("Embedding dimensions: {}".format(embedding.shape[0]))
print("{}: {}".format(text, embedding[:10]))
WARNING:tensorflow:From <ipython-input-1-3d6f4d54c65b>:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 

Warning:tensorflow:From <ipython-input-1-3d6f4d54c65b>:5: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 

Embedding dimensions: 64
[b'headline_text']: [-0.04724706  0.27573067 -0.02340046  0.12461437  0.04809146  0.00246292
  0.15367804 -0.17551982 -0.02778188 -0.185176  ]
Embedding dimensions: 64
[b'aba decides against community broadcasting licence']: [-0.0466345   0.00110549 -0.08875479  0.05938878  0.01933165 -0.05704207
  0.18913773 -0.12833942  0.1816328   0.06035798]
Embedding dimensions: 64
[b'act fire witnesses must be aware of defamation']: [-0.31556517 -0.07618773 -0.14239314 -0.14500496  0.04438541 -0.00983415
  0.01349827 -0.15908629 -0.12947078  0.31871504]
Embedding dimensions: 64
[b'a g calls for infrastructure protection summit']: [ 0.15422247 -0.09829048 -0.16913125 -0.17129296  0.01204466 -0.16008876
 -0.00540507 -0.20552996  0.11388192 -0.03878446]
Embedding dimensions: 64
[b'air nz staff in aust strike for pay rise']: [ 0.13039729 -0.06921542 -0.08830801 -0.09704516 -0.05936369 -0.13036506
 -0.16644046 -0.06228216  0.00742535 -0.13592219]

3. Cree el índice ANN para las incrustaciones.

ANNOY (Vecinos más cercanos aproximados, oh sí) es una biblioteca de C++ con enlaces de Python para buscar puntos en el espacio que estén cerca de un punto de consulta determinado. También crea grandes estructuras de datos basadas en archivos de solo lectura que se asignan a la memoria. Spotify lo crea y lo utiliza para recomendaciones de música.

def build_index(embedding_files_pattern, index_filename, vector_length, 
='angular', num_trees=100):
'''Builds an ANNOY index'''

= annoy.AnnoyIndex(vector_length, metric=metric)
# Mapping between the item and its identifier in the index
= {}

= tf.gfile.Glob(embedding_files_pattern)
print('Found {} embedding file(s).'.format(len(embed_files)))

= 0
for f, embed_file in enumerate(embed_files):
print('Loading embeddings in file {} of {}...'.format(
+1, len(embed_files)))
= tf.io.tf_record_iterator(

for string_record in record_iterator:
= tf.train.Example()
= example.features.feature['text'].bytes_list.value[0].decode("utf-8")
[item_counter] = text
= np.array(
.add_item(item_counter, embedding)
+= 1
if item_counter % 100000 == 0:
print('{} items loaded to the index'.format(item_counter))

print('A total of {} items added to the index'.format(item_counter))

print('Building the index with {} trees...'.format(num_trees))
print('Index is successfully built.')

print('Saving index to disk...')
print('Index is saved to disk.')
print("Index file size: {} GB".format(
(os.path.getsize(index_filename) / float(1024 ** 3), 2)))

print('Saving mapping to disk...')
with open(index_filename + '.mapping', 'wb') as handle:
.dump(mapping, handle, protocol=pickle.HIGHEST_PROTOCOL)
print('Mapping is saved to disk.')
print("Mapping file size: {} MB".format(
(os.path.getsize(index_filename + '.mapping') / float(1024 ** 2), 2)))
embedding_files = "{}/emb-*.tfrecords".format(output_dir)
= projected_dim
= "index"

!rm {index_filename}
!rm {index_filename}.mapping

%time build_index(embedding_files, index_filename, embedding_dimension)
Found 1 embedding file(s).
Loading embeddings in file 1 of 1...
100000 items loaded to the index
200000 items loaded to the index
300000 items loaded to the index
400000 items loaded to the index
500000 items loaded to the index
600000 items loaded to the index
700000 items loaded to the index
800000 items loaded to the index
900000 items loaded to the index
1000000 items loaded to the index
1100000 items loaded to the index
A total of 1103664 items added to the index
Building the index with 100 trees...
Index is successfully built.
Saving index to disk...
Index is saved to disk.
Index file size: 1.66 GB
Saving mapping to disk...
Mapping is saved to disk.
Mapping file size: 50.61 MB
CPU times: user 6min 10s, sys: 3.7 s, total: 6min 14s
Wall time: 1min 36s

corpus  index.mapping         raw.tsv
index   random_projection_matrix  semantic_approximate_nearest_neighbors.ipynb

4. Utilice el índice para la coincidencia de similitudes

Ahora podemos usar el índice ANN para encontrar titulares de noticias que estén semánticamente cerca de una consulta de entrada.

Cargue el índice y los archivos de mapeo.

index = annoy.AnnoyIndex(embedding_dimension)
.load(index_filename, prefault=True)
print('Annoy index is loaded.')
with open(index_filename + '.mapping', 'rb') as handle:
= pickle.load(handle)
print('Mapping file is loaded.')
Annoy index is loaded.

Mapping file is loaded.

Método de coincidencia de similitud

def find_similar_items(embedding, num_matches=5):
'''Finds similar items to a given embedding in the ANN index'''
= index.get_nns_by_vector(
, num_matches, search_k=-1, include_distances=False)
= [mapping[i] for i in ids]
return items

Extraer incrustación de una consulta determinada

# Load the TF-Hub module
print("Loading the TF-Hub module...")
= tf.Graph()
with g.as_default():
= load_module(module_url)
print("TF-Hub module is loaded.")

= None
if os.path.exists('random_projection_matrix'):
print("Loading random projection matrix...")
with open('random_projection_matrix', 'rb') as handle:
= pickle.load(handle)
print('random projection matrix is loaded.')

def extract_embeddings(query):
'''Generates the embedding for the query'''
=  embed_fn([query])[0]
if random_projection_matrix is not None:
= query_embedding.dot(random_projection_matrix)
return query_embedding
Loading the TF-Hub module...
INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

TF-Hub module is loaded.
TF-Hub module is loaded.
Loading random projection matrix...
random projection matrix is loaded.

extract_embeddings("Hello Machine Learning!")[:10]
array([-0.06277051,  0.14012653, -0.15893948,  0.15775941, -0.1226441 ,
       -0.11202384,  0.07953477, -0.08003543,  0.03763271,  0.0302215 ])

Ingrese una consulta para encontrar los artículos más similares

query = "confronting global challenges"
print("Generating embedding for the query...")
%time query_embedding = extract_embeddings(query)

print("Finding relevant items in the index...")
%time items = find_similar_items(query_embedding, 10)

for item in items:

Generating embedding for the query...
CPU times: user 32.9 ms, sys: 19.8 ms, total: 52.7 ms
Wall time: 6.96 ms

Finding relevant items in the index...
CPU times: user 7.19 ms, sys: 370 µs, total: 7.56 ms
Wall time: 953 µs

confronting global challenges
downer challenges un to follow aust example
fairfax loses oshane challenge
jericho social media and the border farce
territory on search for raw comedy talent
interview gred jericho
interview: josh frydenberg; environment and energy
interview: josh frydenberg; environment and energy
world science festival music and climate change
interview with aussie bobsledder

¿Querer aprender más?

Puede obtener más información sobre TensorFlow en tensorflow.org y consultar la documentación de la API de TF-Hub en tensorflow.org/hub . Encuentre los módulos de TensorFlow Hub disponibles en tfhub.dev , incluidos más módulos de incrustación de texto y módulos de vectores de características de imágenes.

Consulte también el curso intensivo de aprendizaje automático , que es la introducción práctica y rápida al aprendizaje automático de Google.