View source on GitHub |
Create an n-step transition from a trajectory with T=N + 1
frames.
tf_agents.trajectories.to_n_step_transition(
trajectory: tf_agents.trajectories.Trajectory
,
gamma: tf_agents.typing.types.Float
) -> tf_agents.trajectories.Transition
The output transition's next_time_step.{reward, discount}
will contain
N-step discounted reward and discount values calculated as:
next_time_step.reward = r_t +
g^{1} * d_t * r_{t+1} +
g^{2} * d_t * d_{t+1} * r_{t+2} +
g^{3} * d_t * d_{t+1} * d_{t+2} * r_{t+3} +
...
g^{N-1} * d_t * ... * d_{t+N-2} * r_{t+N-1}
next_time_step.discount = g^{N-1} * d_t * d_{t+1} * ... * d_{t+N-1}
In python notation:
discount = gamma**(N-1) * reduce_prod(trajectory.discount[:, :-1])
reward = discounted_return(
rewards=trajectory.reward[:, :-1],
discounts=gamma * trajectory.discount[:, :-1])
When trajectory.discount[:, :-1]
is an all-ones tensor, this is equivalent
to:
next_time_step.discount = (
gamma**(N-1) * tf.ones_like(trajectory.discount[:, 0]))
next_time_step.reward = (
sum_{n=0}^{N-1} gamma**n * trajectory.reward[:, n])
Returns | |
---|---|
An N-step Transition where N = T - 1 . The reward and discount in
time_step.{reward, discount} are NaN. The n-step discounted reward
and final discount are stored in next_time_step.{reward, discount} .
All tensors in the Transition have shape [B, ...] (no time dimension).
|
Raises | |
---|---|
ValueError
|
if discount.shape.rank != 2 .
|
ValueError
|
if discount.shape[1] < 2 .
|