![]() |
![]() |
![]() |
![]() |
此文本分类教程将在 IMDB 大型电影评论数据集上训练循环神经网络,以进行情感分析。
设置
import tensorflow_datasets as tfds
import tensorflow as tf
导入 matplotlib
并创建一个辅助函数来绘制计算图:
import matplotlib.pyplot as plt
def plot_graphs(history, metric):
plt.plot(history.history[metric])
plt.plot(history.history['val_'+metric], '')
plt.xlabel("Epochs")
plt.ylabel(metric)
plt.legend([metric, 'val_'+metric])
plt.show()
设置输入流水线
IMDB 大型电影评论数据集是一个二进制分类数据集——所有评论都具有正面或负面情绪。
使用 TFDS 下载数据集。
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
WARNING:absl:TFDS datasets with text encoding are deprecated and will be removed in a future version. Instead, you should use the plain text version and tokenize the text using `tensorflow_text` (See: https://www.tensorflow.org/tutorials/tensorflow_text/intro#tfdata_example)
数据集 info
包括编码器 (tfds.features.text.SubwordTextEncoder
)。
encoder = info.features['text'].encoder
print('Vocabulary size: {}'.format(encoder.vocab_size))
Vocabulary size: 8185
此文本编码器将以可逆方式对任何字符串进行编码,并在必要时退回到字节编码。
sample_string = 'Hello TensorFlow.'
encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))
original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))
Encoded string is [4025, 222, 6307, 2327, 4043, 2120, 7975] The original string: "Hello TensorFlow."
assert original_string == sample_string
for index in encoded_string:
print('{} ----> {}'.format(index, encoder.decode([index])))
4025 ----> Hell 222 ----> o 6307 ----> Ten 2327 ----> sor 4043 ----> Fl 2120 ----> ow 7975 ----> .
准备用于训练的数据
接下来,创建这些编码字符串的批次。使用 padded_batch
方法将序列零填充至批次中最长字符串的长度:
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
test_dataset = test_dataset.padded_batch(BATCH_SIZE)
创建模型
构建一个 tf.keras.Sequential
模型并从嵌入向量层开始。嵌入向量层每个单词存储一个向量。调用时,它会将单词索引序列转换为向量序列。这些向量是可训练的。(在足够的数据上)训练后,具有相似含义的单词通常具有相似的向量。
与通过 tf.keras.layers.Dense
层传递独热编码向量的等效运算相比,这种索引查找方法要高效得多。
循环神经网络 (RNN) 通过遍历元素来处理序列输入。RNN 将输出从一个时间步骤传递到其输入,然后传递到下一个步骤。
tf.keras.layers.Bidirectional
包装器也可以与 RNN 层一起使用。这将通过 RNN 层向前和向后传播输入,然后连接输出。这有助于 RNN 学习长程依赖关系。
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
请注意,我们在这里选择 Keras 序贯模型,因为模型中的所有层都只有单个输入并产生单个输出。如果要使用有状态 RNN 层,则可能需要使用 Keras 函数式 API 或模型子类化来构建模型,以便可以检索和重用 RNN 层状态。有关更多详细信息,请参阅 Keras RNN 指南。
编译 Keras 模型以配置训练过程:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
训练模型
history = model.fit(train_dataset, epochs=10,
validation_data=test_dataset,
validation_steps=30)
Epoch 1/10 391/391 [==============================] - 53s 123ms/step - loss: 0.6715 - accuracy: 0.5157 - val_loss: 0.5733 - val_accuracy: 0.7245 Epoch 2/10 391/391 [==============================] - 50s 125ms/step - loss: 0.3973 - accuracy: 0.8309 - val_loss: 0.3611 - val_accuracy: 0.8562 Epoch 3/10 391/391 [==============================] - 50s 126ms/step - loss: 0.2736 - accuracy: 0.8927 - val_loss: 0.3334 - val_accuracy: 0.8516 Epoch 4/10 391/391 [==============================] - 50s 126ms/step - loss: 0.2226 - accuracy: 0.9161 - val_loss: 0.3484 - val_accuracy: 0.8760 Epoch 5/10 391/391 [==============================] - 49s 125ms/step - loss: 0.1959 - accuracy: 0.9285 - val_loss: 0.3301 - val_accuracy: 0.8687 Epoch 6/10 391/391 [==============================] - 50s 127ms/step - loss: 0.1708 - accuracy: 0.9377 - val_loss: 0.3414 - val_accuracy: 0.8771 Epoch 7/10 391/391 [==============================] - 50s 126ms/step - loss: 0.1514 - accuracy: 0.9464 - val_loss: 0.3593 - val_accuracy: 0.8677 Epoch 8/10 391/391 [==============================] - 50s 125ms/step - loss: 0.1357 - accuracy: 0.9538 - val_loss: 0.4185 - val_accuracy: 0.8740 Epoch 9/10 391/391 [==============================] - 49s 125ms/step - loss: 0.1263 - accuracy: 0.9564 - val_loss: 0.4133 - val_accuracy: 0.8620 Epoch 10/10 391/391 [==============================] - 49s 125ms/step - loss: 0.1149 - accuracy: 0.9621 - val_loss: 0.4104 - val_accuracy: 0.8573
test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/391 [==============================] - 20s 52ms/step - loss: 0.4349 - accuracy: 0.8540 Test Loss: 0.43485820293426514 Test Accuracy: 0.8539999723434448
上面的模型没有遮盖应用于序列的填充。如果在填充序列上进行训练并在未填充序列上进行测试,则可能导致倾斜。理想情况下,您可以使用遮盖来避免这种情况,但是正如您在下面看到的那样,它只会对输出产生很小的影响。
如果预测 >= 0.5,则为正,否则为负。
def pad_to_size(vec, size):
zeros = [0] * (size - len(vec))
vec.extend(zeros)
return vec
def sample_predict(sample_pred_text, pad):
encoded_sample_pred_text = encoder.encode(sample_pred_text)
if pad:
encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))
return (predictions)
# predict on a sample text without padding.
sample_pred_text = ('The movie was cool. The animation and the graphics '
'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)
1/1 [==============================] - 1s 792ms/step [[-0.14333132]]
# predict on a sample text with padding
sample_pred_text = ('The movie was cool. The animation and the graphics '
'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)
1/1 [==============================] - 1s 752ms/step [[-0.24068204]]
plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')
堆叠两个或更多 LSTM 层
Keras 循环层有两种可用的模式,这些模式由 return_sequences
构造函数参数控制:
- 返回每个时间步骤的连续输出的完整序列(形状为
(batch_size, timesteps, output_features)
的 3D 张量)。 - 仅返回每个输入序列的最后一个输出(形状为 (batch_size, output_features) 的 2D 张量)。
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(1e-4),
metrics=['accuracy'])
history = model.fit(train_dataset, epochs=10,
validation_data=test_dataset,
validation_steps=30)
Epoch 1/10 391/391 [==============================] - 101s 241ms/step - loss: 0.6406 - accuracy: 0.5714 - val_loss: 0.4741 - val_accuracy: 0.7552 Epoch 2/10 391/391 [==============================] - 97s 247ms/step - loss: 0.3608 - accuracy: 0.8541 - val_loss: 0.3459 - val_accuracy: 0.8604 Epoch 3/10 391/391 [==============================] - 98s 251ms/step - loss: 0.2662 - accuracy: 0.9014 - val_loss: 0.3523 - val_accuracy: 0.8672 Epoch 4/10 391/391 [==============================] - 98s 251ms/step - loss: 0.2128 - accuracy: 0.9257 - val_loss: 0.3461 - val_accuracy: 0.8630 Epoch 5/10 391/391 [==============================] - 99s 252ms/step - loss: 0.1802 - accuracy: 0.9408 - val_loss: 0.3877 - val_accuracy: 0.8682 Epoch 6/10 391/391 [==============================] - 99s 251ms/step - loss: 0.1578 - accuracy: 0.9518 - val_loss: 0.3993 - val_accuracy: 0.8599 Epoch 7/10 391/391 [==============================] - 98s 250ms/step - loss: 0.1347 - accuracy: 0.9598 - val_loss: 0.4265 - val_accuracy: 0.8578 Epoch 8/10 391/391 [==============================] - 98s 250ms/step - loss: 0.1215 - accuracy: 0.9658 - val_loss: 0.4929 - val_accuracy: 0.8557 Epoch 9/10 391/391 [==============================] - 98s 251ms/step - loss: 0.1152 - accuracy: 0.9674 - val_loss: 0.5007 - val_accuracy: 0.8599 Epoch 10/10 391/391 [==============================] - 98s 251ms/step - loss: 0.0936 - accuracy: 0.9764 - val_loss: 0.5229 - val_accuracy: 0.8505
test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/391 [==============================] - 42s 107ms/step - loss: 0.5224 - accuracy: 0.8460 Test Loss: 0.5224239230155945 Test Accuracy: 0.8459600210189819
# predict on a sample text without padding.
sample_pred_text = ('The movie was not good. The animation and the graphics '
'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)
1/1 [==============================] - 1s 1s/step [[-2.5599978]]
# predict on a sample text with padding
sample_pred_text = ('The movie was not good. The animation and the graphics '
'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)
1/1 [==============================] - 1s 1s/step [[-2.953724]]
plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')
检查其他现有循环层,例如 GRU 层。
如果您对构建自定义 RNN 感兴趣,请参阅 Keras RNN 指南。