TensorFlow is back at Google I/O on May 14! Register now

Universal Sentence Encoder SentEval 演示

View on TensorFlow.org

本 Colab 使用 SentEval 工具套件演示 Universal Sentence Encoder CMLM 模型，该工具套件是用于测量句子嵌入质量的库。SentEval 工具套件包括一组多样化的下游任务，能够评估嵌入模型的泛化能力并评估编码的语言属性。

运行前两个代码块设置环境，在第三个代码块中，可以选择一个 SentEval 任务来评估模型。建议使用 GPU 运行时来运行本 Colab。

要了解有关 Universal Sentence Encoder CMLM 模型的更多信息，请参阅 https://openreview.net/forum?id=WDVD4lUCTzU。

Install dependencies

pip install --quiet "tensorflow-text==2.8.*"
pip install --quiet torch==1.8.1

下载 SentEval 和任务数据

本步骤从 github 下载 SentEval 并执行数据脚本下载任务数据。可能需要长达 5 分钟的时间才能完成。

Install SentEval and download task data

rm -rf ./SentEval
git clone https://github.com/facebookresearch/SentEval.git
cd $PWD/SentEval/data/downstream && bash get_transfer_data.bash > /dev/null 2>&1

Cloning into 'SentEval'...
remote: Enumerating objects: 691, done.
remote: Counting objects: 100% (2/2), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 691 (delta 0), reused 2 (delta 0), pack-reused 689
Receiving objects: 100% (691/691), 33.25 MiB | 28.21 MiB/s, done.
Resolving deltas: 100% (434/434), done.

执行 SentEval 评估任务以下代码块执行 SentEval 任务并输出结果，选择以下任务之一来评估 USE CMLM 模型：

MR  CR  SUBJ    MPQA    SST TREC    MRPC    SICK-E

选择要运行的模型、参数和任务。可以使用 rapid prototyping 参数减少计算时间以更快获得结果。

使用 'rapid prototyping' 参数完成任务通常需要 5-15 分钟，使用 'slower, best performance' 参数最多需要一个小时。

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
                                 'tenacity': 3, 'epoch_size': 2}

要获得更好的结果，请使用较慢的 'slower, best performance' 参数，计算时间可能长达 1 小时：

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 16,
                                 'tenacity': 5, 'epoch_size': 6}

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import sys
sys.path.append(f'{os.getcwd()}/SentEval')

import tensorflow as tf

# Prevent TF from claiming all GPU memory so there is some left for pytorch.
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Memory growth needs to be the same across GPUs.
  for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

import tensorflow_hub as hub
import tensorflow_text
import senteval
import time

PATH_TO_DATA = f'{os.getcwd()}/SentEval/data'
MODEL = 'https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1'
PARAMS = 'rapid prototyping'
TASK = 'CR'

params_prototyping = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params_prototyping['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
                                 'tenacity': 3, 'epoch_size': 2}

params_best = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params_best['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 16,
                                 'tenacity': 5, 'epoch_size': 6}

params = params_best if PARAMS == 'slower, best performance' else params_prototyping

preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
encoder = hub.KerasLayer(
    "https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1")

inputs = tf.keras.Input(shape=tf.shape(''), dtype=tf.string)
outputs = encoder(preprocessor(inputs))

model = tf.keras.Model(inputs=inputs, outputs=outputs)

def prepare(params, samples):
    return

def batcher(_, batch):
    batch = [' '.join(sent) if sent else '.' for sent in batch]
    return model.predict(tf.constant(batch))["default"]


se = senteval.engine.SE(params, batcher, prepare)
print("Evaluating task %s with %s parameters" % (TASK, PARAMS))
start = time.time()
results = se.eval(TASK)
end = time.time()
print('Time took on task %s : %.1f. seconds' % (TASK, end - start))
print(results)

Evaluating task CR with rapid prototyping parameters
Time took on task CR : 53.0. seconds
{'devacc': 90.42, 'acc': 88.98, 'ndev': 3775, 'ntest': 3775}

了解更多

在 TensorFlow Hub 上查找更多文本嵌入模型
另请参阅多语言 Universal Sentence Encoder CMLM 模型
查看其他 Universal Sentence Encoder 模型

参考

Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, Eric Darve. Universal Sentence Representations Learning with Conditional Masked Language Model. November 2020