Multilingual Universal Sentence Encoder Q&A 检索

在 查看 在 Google Colab 中运行 在 GitHub 中查看源代码 下载笔记本 查看 TF Hub 模型

这是使用 Univeral Encoder Multilingual Q&A 模型进行文本问答检索的演示,其中对模型的 question_encoderresponse_encoder 的用法进行了说明。我们使用来自 SQuAD 段落的句子作为演示数据集,每个句子及其上下文(句子周围的文本)都使用 response_encoder 编码为高维嵌入向量。这些嵌入向量存储在使用 simpleneighbors 库构建的索引中,用于问答检索。

检索时,从 SQuAD 数据集中随机选择一个问题,并使用 question_encoder 将其编码为高维嵌入向量,然后查询 simpleneighbors 索引会返回语义空间中最近邻的列表。


您可以在此处找到所有当前托管的文本嵌入向量模型,还可以在此处找到所有在 SQuADYou 上训练过的模型。


Setup Environment

Setup common imports and functions

[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/

运行以下代码块,下载并将 SQuAD 数据集提取为:

  • 句子是(文本, 上下文)元组的列表,SQuAD 数据集中的每个段落都用 NLTK 库拆分成句子,并且句子和段落文本构成(文本, 上下文)元组。
  • 问题是(问题, 答案)元组的列表。

注:您可以选择下面的 squad_url,使用本演示为 SQuAD 训练数据集或较小的 dev 数据集(1.1 或 2.0)建立索引。

Download and extract SQuAD data

10452 sentences, 10552 questions extracted from SQuAD

Example sentence and context:


('The success of Roots, Happy Days and The Love Boat allowed the network to '
 'take first place in the ratings for the first time in the 1976–77 season.')


('For its part, the television network produced a few new hits during 1977: '
 'January saw the premiere of Roots, a miniseries based on an Alex Haley novel '
 'that was published the previous year; in September, The Love Boat, a '
 'comedy-drama anthology series produced by Aaron Spelling which was based '
 'around the crew of a cruise ship and featured three stories centered partly '
 "on the ship's various passengers; although critically lambasted, the series "
 'turned out to be a ratings success and lasted nine seasons. Roots went on to '
 'become one of the highest-rated programs in American television history, '
 'with unprecedented ratings for its finale. The success of Roots, Happy Days '
 'and The Love Boat allowed the network to take first place in the ratings for '
 'the first time in the 1976–77 season. On September 13, 1977, the network '
 'debuted Soap, a controversial soap opera parody which became known for being '
 'the first television series to feature an openly gay main character (played '
 'by a then-unknown Billy Crystal); it last ran on the network on April 20, '

以下代码块使用 Univeral Encoder Multilingual Q&A 模型question_encoderresponse_encoder 签名对 TensorFlow 计算图 g会话进行设置。

Load model from tensorflow hub

以下代码块计算所有文本的嵌入向量和上下文元组,并使用 response_encoder 将它们存储在 simpleneighbors 索引中。

Compute embeddings and build simpleneighbors index

Computing embeddings for 10452 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10452 sentences built.

检索时,使用 question_encoder 对问题进行编码,而问题嵌入向量用于查询 simpleneighbors 索引。

Retrieve nearest neighbors for a random question from SQuAD