Multilingual Universal Sentence Encoder Q&A 检索

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

下载笔记本

查看 TF Hub 模型

这是使用 Univeral Encoder Multilingual Q&A 模型进行文本问答检索的演示，其中对模型的 question_encoder 和 response_encoder 的用法进行了说明。我们使用来自 SQuAD 段落的句子作为演示数据集，每个句子及其上下文（句子周围的文本）都使用 response_encoder 编码为高维嵌入向量。这些嵌入向量存储在使用 simpleneighbors 库构建的索引中，用于问答检索。

检索时，从 SQuAD 数据集中随机选择一个问题，并使用 question_encoder 将其编码为高维嵌入向量，然后查询 simpleneighbors 索引会返回语义空间中最近邻的列表。

安装

Setup Environment

%%capture
# Install the latest Tensorflow version.
!pip install -q "tensorflow-text==2.11.*"
!pip install -q simpleneighbors[annoy]
!pip install -q nltk
!pip install -q tqdm

Setup common imports and functions

import json
import nltk
import os
import pprint
import random
import simpleneighbors
import urllib
from IPython.display import HTML, display
from tqdm.notebook import tqdm

import tensorflow.compat.v2 as tf
import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer

nltk.download('punkt')


def download_squad(url):
  return json.load(urllib.request.urlopen(url))

def extract_sentences_from_squad_json(squad):
  all_sentences = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      sentences = nltk.tokenize.sent_tokenize(paragraph['context'])
      all_sentences.extend(zip(sentences, [paragraph['context']] * len(sentences)))
  return list(set(all_sentences)) # remove duplicates

def extract_questions_from_squad_json(squad):
  questions = []
  for data in squad['data']:
    for paragraph in data['paragraphs']:
      for qas in paragraph['qas']:
        if qas['answers']:
          questions.append((qas['question'], qas['answers'][0]['text']))
  return list(set(questions))

def output_with_highlight(text, highlight):
  output = "<li> "
  i = text.find(highlight)
  while True:
    if i == -1:
      output += text
      break
    output += text[0:i]
    output += '<b>'+text[i:i+len(highlight)]+'</b>'
    text = text[i+len(highlight):]
    i = text.find(highlight)
  return output + "</li>\n"

def display_nearest_neighbors(query_text, answer_text=None):
  query_embedding = model.signatures['question_encoder'](tf.constant([query_text]))['outputs'][0]
  search_results = index.nearest(query_embedding, n=num_results)

  if answer_text:
    result_md = '''
    <p>Random Question from SQuAD:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    <p>Answer:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % (query_text , answer_text)
  else:
    result_md = '''
    <p>Question:</p>
    <p>&nbsp;&nbsp;<b>%s</b></p>
    ''' % query_text

  result_md += '''
    <p>Retrieved sentences :
    <ol>
  '''

  if answer_text:
    for s in search_results:
      result_md += output_with_highlight(s, answer_text)
  else:
    for s in search_results:
      result_md += '<li>' + s + '</li>\n'

  result_md += "</ol>"
  display(HTML(result_md))

2023-11-07 18:08:09.796782: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-07 18:08:10.465902: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-11-07 18:08:10.465997: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-11-07 18:08:10.466007: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[nltk_data] Downloading package punkt to /home/kbuilder/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

运行以下代码块，下载并将 SQuAD 数据集提取到：

句子是（文本, 上下文）元组的列表，SQuAD 数据集中的每个段落都用 NLTK 库拆分成句子，并且句子和段落文本构成（文本, 上下文）元组。
问题是（问题, 答案）元组的列表。

注：您可以选择下面的 squad_url，使用本演示为 SQuAD 训练数据集或较小的 dev 数据集（1.1 或 2.0）建立索引。

Download and extract SQuAD data

squad_url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json'

squad_json = download_squad(squad_url)
sentences = extract_sentences_from_squad_json(squad_json)
questions = extract_questions_from_squad_json(squad_json)
print("%s sentences, %s questions extracted from SQuAD %s" % (len(sentences), len(questions), squad_url))

print("\nExample sentence and context:\n")
sentence = random.choice(sentences)
print("sentence:\n")
pprint.pprint(sentence[0])
print("\ncontext:\n")
pprint.pprint(sentence[1])
print()

10455 sentences, 10552 questions extracted from SQuAD https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

Example sentence and context:

sentence:

('While official data does not capture the real extent of private schooling in '
 'the country, various studies have reported unpopularity of government '
 'schools and an increasing number of private schools.')

context:

('Legally, only non-profit trusts and societies can run schools in India. They '
 'will have to satisfy a number of infrastructure and human resource related '
 'criteria to get Recognition (a form of license) from the government. Critics '
 'of this system point out that this leads to corruption by school inspectors '
 'who check compliance and to fewer schools in a country that has the largest '
 'adult illiterate population in the world. While official data does not '
 'capture the real extent of private schooling in the country, various studies '
 'have reported unpopularity of government schools and an increasing number of '
 'private schools. The Annual Status of Education Report (ASER), which '
 'evaluates learning levels in rural India, has been reporting poorer academic '
 'achievement in government schools than in private schools. A key difference '
 'between the government and private schools is that the medium of education '
 'in private schools is English while it is the local language in government '
 'schools.')

以下代码块使用 Universal Encoder Multilingual Q&A 模型的 question_encoder 和 response_encoder 签名对 TensorFlow 计算图 g 和会话进行设置。

Load model from tensorflow hub

module_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/3"
model = hub.load(module_url)

2023-11-07 18:08:17.196246: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.196359: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.196424: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.196485: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.255270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2023-11-07 18:08:17.255499: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

以下代码块计算所有文本的嵌入向量和上下文元组，并使用 response_encoder 将它们存储在 simpleneighbors 索引中。

Compute embeddings and build simpleneighbors index

batch_size = 100

encodings = model.signatures['response_encoder'](
  input=tf.constant([sentences[0][0]]),
  context=tf.constant([sentences[0][1]]))
index = simpleneighbors.SimpleNeighbors(
    len(encodings['outputs'][0]), metric='angular')

print('Computing embeddings for %s sentences' % len(sentences))
slices = zip(*(iter(sentences),) * batch_size)
num_batches = int(len(sentences) / batch_size)
for s in tqdm(slices, total=num_batches):
  response_batch = list([r for r, c in s])
  context_batch = list([c for r, c in s])
  encodings = model.signatures['response_encoder'](
    input=tf.constant(response_batch),
    context=tf.constant(context_batch)
  )
  for batch_index, batch in enumerate(response_batch):
    index.add_one(batch, encodings['outputs'][batch_index])

index.build()
print('simpleneighbors index for %s sentences built.' % len(sentences))

Computing embeddings for 10455 sentences
0%|          | 0/104 [00:00<?, ?it/s]
simpleneighbors index for 10455 sentences built.

检索时，使用 question_encoder 对问题进行编码，而问题嵌入向量用于查询 simpleneighbors 索引。

Retrieve nearest neighbors for a random question from SQuAD

num_results = 25

query = random.choice(questions)
display_nearest_neighbors(query[0], query[1])