12월 7일 Women in ML 심포지엄 참석

# 单词嵌入向量

## 用数字表示文本

### 用一个唯一的数字编码每个单词

• 整数编码是任意的（它不会捕获单词之间的任何关系）。

• 对于要解释的模型而言，整数编码颇具挑战。例如，线性分类器针对每个特征学习一个权重。由于任何两个单词的相似性与其编码的相似性之间都没有关系，因此这种特征权重组合没有意义。

## 设置

``````import tensorflow as tf
``````
``````from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_datasets as tfds
tfds.disable_progress_bar()
``````

## 使用嵌入向量层

Keras 让使用单词嵌入向量变得轻而易举。我们来看一下嵌入向量层。

``````embedding_layer = layers.Embedding(1000, 5)
``````

``````result = embedding_layer(tf.constant([1,2,3]))
result.numpy()
``````
```array([[-0.03292269, -0.02773439, -0.0442404 , -0.0350192 , -0.00886874],
[-0.0472883 , -0.0225469 , -0.03131614, -0.01208062, -0.03646743],
[ 0.01017697, -0.02376412,  0.04766024,  0.02681856, -0.00058727]],
dtype=float32)
```

``````result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.shape
``````
```TensorShape([2, 3, 5])
```

## 从头开始学习嵌入向量

``````(train_data, test_data), info = tfds.load(
'imdb_reviews/subwords8k',
split = (tfds.Split.TRAIN, tfds.Split.TEST),
with_info=True, as_supervised=True)
``````
```WARNING:absl:TFDS datasets with text encoding are deprecated and will be removed in a future version. Instead, you should use the plain text version and tokenize the text using `tensorflow_text` (See: https://www.tensorflow.org/tutorials/tensorflow_text/intro#tfdata_example)
```

``````encoder = info.features['text'].encoder
encoder.subwords[:20]
``````
```['the_',
', ',
'. ',
'a_',
'and_',
'of_',
'to_',
's_',
'is_',
'br',
'in_',
'I_',
'that_',
'this_',
'it_',
' /><',
' />',
'was_',
'The_',
'as_']
```

``````train_batches = train_data.shuffle(1000).padded_batch(10)
``````

``````train_batch, train_labels = next(iter(train_batches))
train_batch.numpy()
``````
```array([[ 878, 1459,  610, ...,    0,    0,    0],
[  62,   32,   18, ...,    0,    0,    0],
[6691,  246, 3271, ...,    0,    0,    0],
...,
[  12,   81,  641, ..., 7961, 3388, 7975],
[  12,   31,  853, ...,    0,    0,    0],
[  62,   32,   18, ...,    0,    0,    0]])
```

### 创建一个简单模型

• 接下来，嵌入向量层将采用整数编码的词汇表，并查找每个单词索引的嵌入向量。在模型训练时会学习这些向量。向量会向输出数组添加维度。得到的维度为：`(batch, sequence, embedding)`

• 接下来，通过对序列维度求平均值，GlobalAveragePooling1D 层会返回每个样本的固定长度输出向量。这让模型能够以最简单的方式处理可变长度的输入。

• 此固定长度输出向量通过一个包含 16 个隐藏单元的完全连接（密集）层进行流水线传输。

• 最后一层与单个输出节点密集连接。利用 Sigmoid 激活函数，得出此值是 0 到 1 之间的浮点数，表示评论为正面的概率（或置信度）。

``````embedding_dim=16

model = keras.Sequential([
layers.Embedding(encoder.vocab_size, embedding_dim),
layers.GlobalAveragePooling1D(),
layers.Dense(16, activation='relu'),
layers.Dense(1)
])

model.summary()
``````
```Model: "sequential"
_________________________________________________________________
Layer (type)                Output Shape              Param #
=================================================================
embedding_1 (Embedding)     (None, None, 16)          130960

global_average_pooling1d (G  (None, 16)               0
lobalAveragePooling1D)

dense (Dense)               (None, 16)                272

dense_1 (Dense)             (None, 1)                 17

=================================================================
Total params: 131,249
Trainable params: 131,249
Non-trainable params: 0
_________________________________________________________________
```

### 编译和训练模型

``````model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])

history = model.fit(
train_batches,
epochs=10,
validation_data=test_batches, validation_steps=20)
``````
```Epoch 1/10
2500/2500 [==============================] - 10s 4ms/step - loss: 0.5114 - accuracy: 0.6913 - val_loss: 0.3791 - val_accuracy: 0.8150
Epoch 2/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.2866 - accuracy: 0.8816 - val_loss: 0.3461 - val_accuracy: 0.8700
Epoch 3/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.2293 - accuracy: 0.9088 - val_loss: 0.4528 - val_accuracy: 0.8450
Epoch 4/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.2025 - accuracy: 0.9219 - val_loss: 0.3455 - val_accuracy: 0.8650
Epoch 5/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.1781 - accuracy: 0.9339 - val_loss: 0.4126 - val_accuracy: 0.8600
Epoch 6/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.1596 - accuracy: 0.9416 - val_loss: 0.3470 - val_accuracy: 0.9150
Epoch 7/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.1476 - accuracy: 0.9479 - val_loss: 0.3088 - val_accuracy: 0.8650
Epoch 8/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.1323 - accuracy: 0.9538 - val_loss: 0.4188 - val_accuracy: 0.8800
Epoch 9/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.1233 - accuracy: 0.9560 - val_loss: 0.5700 - val_accuracy: 0.8750
Epoch 10/10
2500/2500 [==============================] - 9s 4ms/step - loss: 0.1126 - accuracy: 0.9599 - val_loss: 0.3705 - val_accuracy: 0.8700
```

``````import matplotlib.pyplot as plt

history_dict = history.history

acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss=history_dict['loss']
val_loss=history_dict['val_loss']

epochs = range(1, len(acc) + 1)

plt.figure(figsize=(12,9))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

plt.figure(figsize=(12,9))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.ylim((0.5,1))
plt.show()
``````

## 检索学习的嵌入向量

``````e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
``````
```(8185, 16)
```

``````import io

encoder = info.features['text'].encoder

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')

for num, word in enumerate(encoder.subwords):
vec = weights[num+1] # skip 0, it's padding.
out_m.write(word + "\n")
out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
out_m.close()
``````

``````try:
except ImportError:
pass
else:
``````

## 可视化嵌入向量

• 上传我们在上面创建的两个文件：`vecs.tsv``meta.tsv`

## 后续步骤

• 要了解循环网络，请参阅 Keras RNN 指南

• 要详细了解文本分类（包括整个工作流，以及如果您对何时使用嵌入向量还是独热编码感到好奇），我们建议您阅读这篇实用的文本分类指南

[]
[]