![]() |
![]() |
![]() |
{img1下载笔记本 |
TF-Hub (https://hub.tensorflow.google.cn/tensorflow/cord-19/swivel-128d/3) 上的 CORD-19 Swivel 文本嵌入向量模块旨在支持研究员分析与 COVID-19 相关的自然语言文本。这些嵌入针对 CORD-19 数据集中文章的标题、作者、摘要、正文文本和参考文献标题进行了训练。
在此 Colab 中,我们将进行以下操作:
- 分析嵌入向量空间中语义相似的单词
- 使用 CORD-19 嵌入向量在 SciCite 数据集上训练分类器
设置
import functools
import itertools
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub
from tqdm import trange
分析嵌入向量
首先,我们通过计算和绘制不同术语之间的相关矩阵来分析嵌入向量。如果嵌入向量学会了成功捕获不同单词的含义,则语义相似的单词的嵌入向量应相互靠近。我们来看一些与 COVID-19 相关的术语。
# Use the inner product between two embedding vectors as the similarity measure
def plot_correlation(labels, features):
corr = np.inner(features, features)
corr /= np.max(corr)
sns.heatmap(corr, xticklabels=labels, yticklabels=labels)
# Generate embeddings for some terms
queries = [
# Related viruses
'coronavirus', 'SARS', 'MERS',
# Regions
'Italy', 'Spain', 'Europe',
# Symptoms
'cough', 'fever', 'throat'
]
module = hub.load('https://hub.tensorflow.google.cn/tensorflow/cord-19/swivel-128d/3')
embeddings = module(queries)
plot_correlation(queries, embeddings)
可以看到,嵌入向量成功捕获了不同术语的含义。每个单词都与其所在簇的其他单词相似(即“coronavirus”与“SARS”和“MERS”高度相关),但与其他簇的术语不同(即“SARS”与“Spain”之间的相似度接近于 0)。
现在,我们来看看如何使用这些嵌入向量解决特定任务。
SciCite:引用意图分类
本部分介绍了将嵌入向量用于下游任务(如文本分类)的方法。我们将使用 TensorFlow 数据集中的 SciCite 数据集对学术论文中的引文意图进行分类。给定一个带有学术论文引文的句子,对引文的主要意图进行分类:是背景信息、使用方法,还是比较结果。
builder = tfds.builder(name='scicite')
builder.download_and_prepare()
train_data, validation_data, test_data = builder.as_dataset(
split=('train', 'validation', 'test'),
as_supervised=True)
Let's take a look at a few labeled examples from the training set
NUM_EXAMPLES = 10
TEXT_FEATURE_NAME = builder.info.supervised_keys[0]
LABEL_NAME = builder.info.supervised_keys[1]
def label2str(numeric_label):
m = builder.info.features[LABEL_NAME].names
return m[numeric_label]
data = next(iter(train_data.batch(NUM_EXAMPLES)))
pd.DataFrame({
TEXT_FEATURE_NAME: [ex.numpy().decode('utf8') for ex in data[0]],
LABEL_NAME: [label2str(x) for x in data[1]]
})
训练引用意图分类器
我们将使用 Keras 在 SciCite 数据集上对分类器进行训练。我们构建一个模型,该模型使用 CORD-19 嵌入向量,并在顶部具有一个分类层。
Hyperparameters
EMBEDDING = 'https://hub.tensorflow.google.cn/tensorflow/cord-19/swivel-128d/3'
TRAINABLE_MODULE = False
hub_layer = hub.KerasLayer(EMBEDDING, input_shape=[],
dtype=tf.string, trainable=TRAINABLE_MODULE)
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(3))
model.summary()
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
WARNING:tensorflow:5 out of the last 5 calls to <function recreate_function.<locals>.restored_function_body at 0x7f9cdc0ce510> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:5 out of the last 5 calls to <function recreate_function.<locals>.restored_function_body at 0x7f9cdc0ce510> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because its dtype defaults to floatx. If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2. To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor. WARNING:tensorflow:Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2. The layer has dtype float32 because its dtype defaults to floatx. If you intended to run this layer in float32, you can safely ignore this warning. If in doubt, this warning is likely only an issue if you are porting a TensorFlow 1.X model to TensorFlow 2. To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor. Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= keras_layer (KerasLayer) (None, 128) 17301632 _________________________________________________________________ dense (Dense) (None, 3) 387 ================================================================= Total params: 17,302,019 Trainable params: 387 Non-trainable params: 17,301,632 _________________________________________________________________
训练并评估模型
让我们训练并评估模型以查看在 SciCite 任务上的性能。
EPOCHS = 35
BATCH_SIZE = 32
history = model.fit(train_data.shuffle(10000).batch(BATCH_SIZE),
epochs=EPOCHS,
validation_data=validation_data.batch(BATCH_SIZE),
verbose=1)
Epoch 1/35 257/257 [==============================] - 1s 5ms/step - loss: 0.8688 - accuracy: 0.5978 - val_loss: 0.7558 - val_accuracy: 0.7041 Epoch 2/35 257/257 [==============================] - 1s 4ms/step - loss: 0.6813 - accuracy: 0.7278 - val_loss: 0.6609 - val_accuracy: 0.7336 Epoch 3/35 257/257 [==============================] - 1s 4ms/step - loss: 0.6146 - accuracy: 0.7580 - val_loss: 0.6197 - val_accuracy: 0.7587 Epoch 4/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5831 - accuracy: 0.7708 - val_loss: 0.5987 - val_accuracy: 0.7631 Epoch 5/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5637 - accuracy: 0.7796 - val_loss: 0.5872 - val_accuracy: 0.7566 Epoch 6/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5511 - accuracy: 0.7862 - val_loss: 0.5761 - val_accuracy: 0.7653 Epoch 7/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5428 - accuracy: 0.7876 - val_loss: 0.5693 - val_accuracy: 0.7675 Epoch 8/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5361 - accuracy: 0.7897 - val_loss: 0.5675 - val_accuracy: 0.7718 Epoch 9/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5315 - accuracy: 0.7934 - val_loss: 0.5612 - val_accuracy: 0.7718 Epoch 10/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5270 - accuracy: 0.7942 - val_loss: 0.5600 - val_accuracy: 0.7751 Epoch 11/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5237 - accuracy: 0.7933 - val_loss: 0.5570 - val_accuracy: 0.7697 Epoch 12/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5210 - accuracy: 0.7942 - val_loss: 0.5564 - val_accuracy: 0.7707 Epoch 13/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5182 - accuracy: 0.7944 - val_loss: 0.5526 - val_accuracy: 0.7751 Epoch 14/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5160 - accuracy: 0.7957 - val_loss: 0.5530 - val_accuracy: 0.7718 Epoch 15/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5143 - accuracy: 0.7962 - val_loss: 0.5527 - val_accuracy: 0.7751 Epoch 16/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5126 - accuracy: 0.7999 - val_loss: 0.5514 - val_accuracy: 0.7751 Epoch 17/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5110 - accuracy: 0.7981 - val_loss: 0.5489 - val_accuracy: 0.7795 Epoch 18/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5100 - accuracy: 0.7968 - val_loss: 0.5499 - val_accuracy: 0.7806 Epoch 19/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5086 - accuracy: 0.7984 - val_loss: 0.5487 - val_accuracy: 0.7773 Epoch 20/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5078 - accuracy: 0.7985 - val_loss: 0.5473 - val_accuracy: 0.7784 Epoch 21/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5074 - accuracy: 0.7991 - val_loss: 0.5496 - val_accuracy: 0.7828 Epoch 22/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5060 - accuracy: 0.7979 - val_loss: 0.5469 - val_accuracy: 0.7849 Epoch 23/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5049 - accuracy: 0.8017 - val_loss: 0.5471 - val_accuracy: 0.7838 Epoch 24/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5044 - accuracy: 0.7999 - val_loss: 0.5477 - val_accuracy: 0.7871 Epoch 25/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5037 - accuracy: 0.8007 - val_loss: 0.5461 - val_accuracy: 0.7828 Epoch 26/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5031 - accuracy: 0.8000 - val_loss: 0.5466 - val_accuracy: 0.7817 Epoch 27/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5024 - accuracy: 0.7988 - val_loss: 0.5465 - val_accuracy: 0.7860 Epoch 28/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5020 - accuracy: 0.8016 - val_loss: 0.5489 - val_accuracy: 0.7849 Epoch 29/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5016 - accuracy: 0.7991 - val_loss: 0.5489 - val_accuracy: 0.7817 Epoch 30/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5011 - accuracy: 0.8014 - val_loss: 0.5450 - val_accuracy: 0.7860 Epoch 31/35 257/257 [==============================] - 1s 4ms/step - loss: 0.5002 - accuracy: 0.8024 - val_loss: 0.5444 - val_accuracy: 0.7860 Epoch 32/35 257/257 [==============================] - 1s 4ms/step - loss: 0.4999 - accuracy: 0.8018 - val_loss: 0.5457 - val_accuracy: 0.7893 Epoch 33/35 257/257 [==============================] - 1s 4ms/step - loss: 0.4992 - accuracy: 0.8025 - val_loss: 0.5451 - val_accuracy: 0.7882 Epoch 34/35 257/257 [==============================] - 1s 4ms/step - loss: 0.4992 - accuracy: 0.8007 - val_loss: 0.5472 - val_accuracy: 0.7882 Epoch 35/35 257/257 [==============================] - 1s 4ms/step - loss: 0.4981 - accuracy: 0.8033 - val_loss: 0.5461 - val_accuracy: 0.7849
from matplotlib import pyplot as plt
def display_training_curves(training, validation, title, subplot):
if subplot%10==1: # set up the subplots on the first call
plt.subplots(figsize=(10,10), facecolor='#F0F0F0')
plt.tight_layout()
ax = plt.subplot(subplot)
ax.set_facecolor('#F8F8F8')
ax.plot(training)
ax.plot(validation)
ax.set_title('model '+ title)
ax.set_ylabel(title)
ax.set_xlabel('epoch')
ax.legend(['train', 'valid.'])
display_training_curves(history.history['accuracy'], history.history['val_accuracy'], 'accuracy', 211)
display_training_curves(history.history['loss'], history.history['val_loss'], 'loss', 212)
评估模型
我们来看看模型的表现。模型将返回两个值:损失(表示错误的数字,值越低越好)和准确率。
results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
print('%s: %.3f' % (name, value))
4/4 - 0s - loss: 0.5315 - accuracy: 0.7902 loss: 0.532 accuracy: 0.790
可以看到,损失迅速减小,而准确率迅速提高。我们绘制一些样本来检查预测与真实标签的关系:
prediction_dataset = next(iter(test_data.batch(20)))
prediction_texts = [ex.numpy().decode('utf8') for ex in prediction_dataset[0]]
prediction_labels = [label2str(x) for x in prediction_dataset[1]]
predictions = [label2str(x) for x in model.predict_classes(prediction_texts)]
pd.DataFrame({
TEXT_FEATURE_NAME: prediction_texts,
LABEL_NAME: prediction_labels,
'prediction': predictions
})
WARNING:tensorflow:From <ipython-input-1-0e10f5eff104>:6: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01. Instructions for updating: Please use instead:* `np.argmax(model.predict(x), axis=-1)`, if your model does multi-class classification (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`, if your model does binary classification (e.g. if it uses a `sigmoid` last-layer activation). WARNING:tensorflow:From <ipython-input-1-0e10f5eff104>:6: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01. Instructions for updating: Please use instead:* `np.argmax(model.predict(x), axis=-1)`, if your model does multi-class classification (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`, if your model does binary classification (e.g. if it uses a `sigmoid` last-layer activation).
可以看到,对于此随机样本,模型大多数时候都会预测正确的标签,这表明它可以很好地嵌入科学句子。
后续计划
现在,您已经对 TF-Hub 中的 CORD-19 Swivel 嵌入向量有了更多了解,我们鼓励您参加 CORD-19 Kaggle 竞赛,为从 COVID-19 相关学术文本中获得更深入的科学洞见做出贡献。
- 参加 CORD-19 Kaggle Challenge
- 详细了解 COVID-19 Open Research Dataset (CORD-19)
- 访问 https://hub.tensorflow.google.cn/tensorflow/cord-19/swivel-128d/3,参阅文档并详细了解 TF-Hub 嵌入向量
- 使用 TensorFlow Embedding Projector 探索 CORD-19 嵌入向量空间