在启用 Eager Execution 的情况下使用 RNN 生成文本

{0}在 TensorFlow.org 上查看{/0} 在 Google Colab 中运行 查看 GitHub 上的源代码

本教程演示了如何使用基于字符的 RNN 生成文本。我们将使用 Andrej Karpathy 在 The Unreasonable Effectiveness of Recurrent Neural Networks 一文中提供的莎士比亚作品数据集。我们根据此数据(“Shakespear”)中的给定字符序列训练一个模型,让它预测序列的下一个字符(“e”)。通过重复调用该模型,可以生成更长的文本序列。

本教程中包含使用 tf.kerasEager Execution 实现的可运行代码。以下示例显示了使用默认设置运行本教程中的代码时生成的输出:

I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

The cause why then we are all resolved more sons.

O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m


  • 该模型是基于字符的模型。在训练之初,该模型都不知道如何拼写英语单词,甚至不知道单词是一种文本单位。

  • 输出的文本结构仿照了剧本的结构:文本块通常以讲话者的名字开头,并且像数据集中一样,这些名字全部采用大写字母。

  • 如下文所示,尽管该模型只使用小批次的文本(每批文本包含 100 个字符)训练而成,但它仍然能够生成具有连贯结构的更长文本序列。


导入 TensorFlow 和其他库

import tensorflow as tf

import numpy as np
import os
import time



path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
1122304/1115394 [==============================] - 0s 0us/step



text = open(path_to_file).read()
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))
Length of text: 1115394 characters
# Take a look at the first 1000 characters in text
First Citizen:
Before we proceed any further, hear me speak.

Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.

# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))
65 unique characters




# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

现在,每个字符都有一个对应的整数表示值。请注意,我们按从 0 到 len(unique) 的索引映射字符。

for char,_ in zip(char2idx, range(20)):
    print('{:6s} ---> {:4d}'.format(repr(char), char2idx[char]))
'j'    --->   48
'f'    --->   44
'R'    --->   30
':'    --->   10
'W'    --->   35
';'    --->   11
'o'    --->   53
'b'    --->   40
'K'    --->   23
'L'    --->   24
'O'    --->   27
'h'    --->   46
'm'    --->   51
'u'    --->   59
'H'    --->   20
'z'    --->   64
'!'    --->    2
'S'    --->   31
'N'    --->   26
'Z'    --->   38
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(text[:13], text_as_int[:13]))
First Citizen ---- characters mapped to int ---- > [18 47 56 57 58  1 15 47 58 47 64 43 52]



由于 RNN 会依赖之前看到的元素来维持内部状态,那么根据目前为止已计算过的所有字符,下一个字符是什么?


将文本划分为训练样本和训练目标。每个训练样本都包含从文本中选取的 seq_length 个字符。相应的目标也包含相同长度的文本,但是将所选的字符序列向右顺移一个字符。例如,假设 seq_length 为 4,我们的文本为“Hello”,则可以将“Hell”创建为训练样本,将“ello”创建为目标。

将文本拆分成文本块,每个块的长度为 seq_length+1 个字符:

# The maximum length sentence we want for a single input in characters
seq_length = 100

# Create training examples / targets
chunks = tf.data.Dataset.from_tensor_slices(text_as_int).batch(seq_length+1, drop_remainder=True)

for item in chunks.take(5):
'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = chunks.map(split_input_target)

我们输出第一个样本的前 10 个值:

for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))
Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '

这些向量的每个索引均作为一个时间步来处理。对于时间步 0 的输入,我们收到了映射到数字 18 的字符,并尝试预测映射到数字 47 的字符。在时间步 1,执行相同的操作,但除了当前字符外,还要考虑上一步的信息。

for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))
Step    0
  input: 18 ('F')
  expected output: 47 ('i')
Step    1
  input: 47 ('i')
  expected output: 56 ('r')
Step    2
  input: 56 ('r')
  expected output: 57 ('s')
Step    3
  input: 57 ('s')
  expected output: 58 ('t')
Step    4
  input: 58 ('t')
  expected output: 1 (' ')

使用 tf.data 创建批次文本并重排这些批次

我们使用 tf.data 将文本分成块。但在将这些数据馈送到模型中之前,我们需要对数据进行重排,并将其打包成批。

# Batch size

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)



使用 tf.keras 模型子类化 API 创建模型,然后根据需要进行更改。我们可以使用三个层来定义模型:

  • 嵌入层:一个可训练的对照表,它会将每个字符的数字映射到具有 embedding_dim 个维度的高维度向量;
  • GRU 层:一种层大小等于单位数的 RNN。(在此示例中,您也可以使用 LSTM 层。)
  • 密集层:带有 vocab_size 个单元。
class Model(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, units):
    super(Model, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    if tf.test.is_gpu_available():
      self.gru = tf.keras.layers.CuDNNGRU(self.units,
      self.gru = tf.keras.layers.GRU(self.units,

    self.fc = tf.keras.layers.Dense(vocab_size)

  def call(self, x):
    embedding = self.embedding(x)

    # output at every time step
    # output shape == (batch_size, seq_length, hidden_size)
    output = self.gru(embedding)

    # The dense layer will output predictions for every time_steps(seq_length)
    # output shape after the dense layer == (seq_length * batch_size, vocab_size)
    prediction = self.fc(output)

    # states will be used to pass at every step to the model while training
    return prediction


# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
units = 1024

model = Model(vocab_size, embedding_dim, units)

我们将使用采用默认参数的 Adam 优化器,并用 softmax 交叉熵作为损失函数。此损失函数很重要,因为我们要训练模型预测下一个字符,而字符数是一种离散数据(类似于分类问题)。

# Using adam optimizer with default arguments
optimizer = tf.train.AdamOptimizer()

# Using sparse_softmax_cross_entropy so that we don't have to create one-hot vectors
def loss_function(real, preds):
    return tf.losses.sparse_softmax_cross_entropy(labels=real, logits=preds)


在此示例中,我们使用采用 GradientTape 的自定义训练循环。要详细了解此方法,请参阅 Eager Execution 指南

  • 首先,用零和形状(批次大小,RNN 单元数)初始化模型的隐藏状态。为此,我们将调用在创建模型时定义的函数。

  • 然后,逐批对数据集进行迭代,并计算与该输入关联的预测和隐藏状态

  • 在训练过程中,发生了许多有趣的现象:

    • 模型获得隐藏状态(初始化为 0,我们称之为 H0)和第一批输入文本(我们称之为 I0)。
    • 然后,模型返回预测值 P1H1
    • 对于下一批输入,模型收到 I1H1
    • 现在,有趣的是我们将 H1I1 一起传递给模型,模型正是通过这种方式进行学习。从各个批次中学习到的上下文将包含到隐藏状态中
    • 重复上述操作,直到数据集中的数据全部用尽。然后开始一个新的周期,并重复此过程。
  • 计算预测值后,使用上面定义的损失函数计算损失。然后,计算相对于模型变量的损失梯度。

  • 最后,使用 apply_gradients 函数在优化器的帮助下朝着训练的方向迈进一步。


model.build(tf.TensorShape([BATCH_SIZE, seq_length]))
Layer (type)                 Output Shape              Param #
embedding (Embedding)        multiple                  16640
gru (GRU)                    multiple                  3935232
dense (Dense)                multiple                  66625
Total params: 4,018,497
Trainable params: 4,018,497
Non-trainable params: 0
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")


# Training loop
for epoch in range(EPOCHS):
    start = time.time()

    # initializing the hidden state at the start of every epoch
    # initally hidden is None
    hidden = model.reset_states()

    for (batch, (inp, target)) in enumerate(dataset):
          with tf.GradientTape() as tape:
              # feeding the hidden state back into the model
              # This is the interesting step
              predictions = model(inp)
              loss = loss_function(target, predictions)

          grads = tape.gradient(loss, model.variables)
          optimizer.apply_gradients(zip(grads, model.variables))

          if batch % 100 == 0:
              print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch+1,
    # saving (checkpoint) the model every 5 epochs
    if (epoch + 1) % 5 == 0:

    print ('Epoch {} Loss {:.4f}'.format(epoch+1, loss))
    print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))
Epoch 1 Batch 0 Loss 4.1749
Epoch 1 Batch 100 Loss 2.3127
Epoch 1 Loss 2.1260
Time taken for 1 epoch 609.6597940921783 sec

Epoch 2 Batch 0 Loss 2.1049
Epoch 2 Batch 100 Loss 1.8819
Epoch 2 Loss 1.7667
Time taken for 1 epoch 612.5043184757233 sec

Epoch 3 Batch 0 Loss 1.7645
Epoch 3 Batch 100 Loss 1.6853
Epoch 3 Loss 1.6164
Time taken for 1 epoch 610.0756878852844 sec

Epoch 4 Batch 0 Loss 1.6491
Epoch 4 Batch 100 Loss 1.5350
Epoch 4 Loss 1.5071
Time taken for 1 epoch 609.8330454826355 sec

Epoch 5 Batch 0 Loss 1.4715
Epoch 5 Batch 100 Loss 1.4685
Epoch 5 Loss 1.4042
Time taken for 1 epoch 608.6753587722778 sec




!ls {checkpoint_dir}
checkpoint  ckpt.data-00000-of-00001  ckpt.index
model = Model(vocab_size, embedding_dim, units)


model.build(tf.TensorShape([1, None]))



  • 首先选择一个起始字符串,初始化隐藏状态,并设置要生成的字符数。

  • 使用起始字符串和隐藏状态获取预测值。

  • 然后,使用多项分布计算预测字符的索引 - 将此预测字符用作模型的下一个输入。

  • 模型返回的隐藏状态被馈送回模型中,使模型现在拥有更多上下文,而不是仅有一个单词。在模型预测下一个单词之后,经过修改的隐藏状态再次被馈送回模型中,模型从先前预测的单词获取更多上下文,从而通过这种方式进行学习。


# Evaluation step (generating text using the learned model)

# Number of characters to generate
num_generate = 1000

# You can change the start string to experiment
start_string = 'Q'

# Converting our start string to numbers (vectorizing)
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)

# Empty string to store our results
text_generated = []

# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.
temperature = 1.0
# Evaluation loop.

# Here batch size == 1
for i in range(num_generate):
    predictions = model(input_eval)
    # remove the batch dimension
    predictions = tf.squeeze(predictions, 0)

    # using a multinomial distribution to predict the word returned by the model
    predictions = predictions / temperature
    predicted_id = tf.multinomial(predictions, num_samples=1)[-1,0].numpy()

    # We pass the predicted word as the next input to the model
    # along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)


print (start_string + ''.join(text_generated))
If a body.
But I would me your lood.
Steak ungrace and as this only in the ploaduse,
his they, much you amed on't.

Hearn' thousand as your well, and obepional.

Can wathach this wam a discure that braichal heep itspose,
Teparmate confoim it: never knor sheep, so litter
Plarence? He,
But thou sunds a parmon servection:
Occh Rom o'ld him sir;
madish yim,
I'll surm let as hand upherity

Why do I sering their stumble; the thank emo'st yied
Baunted unpluction; the main, sir, What's a meanulainst
Even worship tebomn slatued of his name,
Manisholed shorks you go?

We look thus then impare'd least itsiby drumes,
That I, what!
Nurset, fell beshee that which I will
to the near-Volshing upon this aguin against fless
Is done untlein with is the neck,
Thands he shall fear'ds; let me love at officed:
Where else to her awticions, as you hall, my lord.

I will been another one our accuser less
Tiold, methought to the presench of consiar

如果要改进结果,最简单的方法是增加模型训练的时长(请尝试 EPOCHS=30)。

您还可以尝试使用不同的起始字符,或尝试添加另一个 RNN 层以提高模型的准确率,又或者调整温度参数以生成具有一定随机性的预测值。