Computes generalized advantage estimation (GAE).
tf_agents.utils.value_ops.generalized_advantage_estimation(
values, final_value, discounts, rewards, td_lambda=1.0, time_major=True
)
For theory, see
"High-Dimensional Continuous Control Using Generalized Advantage Estimation"
by John Schulman, Philipp Moritz et al.
See https://arxiv.org/abs/1506.02438 for full paper.
Define abbreviations |
(B) batch size representing number of trajectories
(T) number of steps per trajectory
|
Args |
values
|
Tensor with shape [T, B] representing value estimates.
|
final_value
|
Tensor with shape [B] representing value estimate at t=T.
|
discounts
|
Tensor with shape [T, B] representing discounts received by
following the behavior policy.
|
rewards
|
Tensor with shape [T, B] representing rewards received by
following the behavior policy.
|
td_lambda
|
A float32 scalar between [0, 1]. It's used for variance reduction
in temporal difference.
|
time_major
|
A boolean indicating whether input tensors are time major. False
means input tensors have shape [B, T] .
|
Returns |
A tensor with shape [T, B] representing advantages. Shape is [B, T] when
not time_major .
|