Multilingual Universal Sentence Encoder Q&A 检索

在 TensorFlow.org 上查看 在 Google Colab 中运行 在 GitHub 上查看源代码 下载笔记本 查看 TF Hub 模型

这是使用 Univeral Encoder Multilingual Q&A 模型进行文本问答检索的演示,其中对模型的 question_encoderresponse_encoder 的用法进行了说明。我们使用来自 SQuAD 段落的句子作为演示数据集,每个句子及其上下文(句子周围的文本)都使用 response_encoder 编码为高维嵌入向量。这些嵌入向量存储在使用 simpleneighbors 库构建的索引中,用于问答检索。

检索时,从 SQuAD 数据集中随机选择一个问题,并使用 question_encoder 将其编码为高维嵌入向量,然后查询 simpleneighbors 索引会返回语义空间中最近邻的列表。

更多模型

您可以在此处找到所有当前托管的文本嵌入向量模型,还可以在此处找到所有在 SQuADYou 上训练过的模型。

安装

Setup Environment

Setup common imports and functions

2023-11-07 18:08:09.796782: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-07 18:08:10.465902: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-11-07 18:08:10.465997: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-11-07 18:08:10.466007: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

运行以下代码块,下载并将 SQuAD 数据集提取到:

  • 句子是(文本, 上下文)元组的列表,SQuAD 数据集中的每个段落都用 NLTK 库拆分成句子,并且句子和段落文本构成(文本, 上下文)元组。
  • 问题是(问题, 答案)元组的列表。

注:您可以选择下面的 squad_url,使用本演示为 SQuAD 训练数据集或较小的 dev 数据集(1.1 或 2.0)建立索引。

Download and extract SQuAD data

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('While official data does not capture the real extent of private schooling in '
 'the country, various studies have reported unpopularity of government '
 'schools and an increasing number of private schools.')

context:

('Legally, only non-profit trusts and societies can run schools in India. They '
 'will have to satisfy a number of infrastructure and human resource related '
 'criteria to get Recognition (a form of license) from the government. Critics '
 'of this system point out that this leads to corruption by school inspectors '
 'who check compliance and to fewer schools in a country that has the largest '
 'adult illiterate population in the world. While official data does not '
 'capture the real extent of private schooling in the country, various studies '
 'have reported unpopularity of government schools and an increasing number of '
 'private schools. The Annual Status of Education Report (ASER), which '
 'evaluates learning levels in rural India, has been reporting poorer academic '
 'achievement in government schools than in private schools. A key difference '
 'between the government and private schools is that the medium of education '
 'in private schools is English while it is the local language in government '
 'schools.')

以下代码块使用 Universal Encoder Multilingual Q&A 模型question_encoderresponse_encoder 签名对 TensorFlow 计算图 g会话进行设置。

Load model from tensorflow hub

2023-11-07 18:08:17.196246: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.196359: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.196424: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.196485: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.255270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.255499: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

以下代码块计算所有文本的嵌入向量和上下文元组,并使用 response_encoder 将它们存储在 simpleneighbors 索引中。

Compute embeddings and build simpleneighbors index

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

检索时,使用 question_encoder 对问题进行编码,而问题嵌入向量用于查询 simpleneighbors 索引。

Retrieve nearest neighbors for a random question from SQuAD