For details, see
"Reinforcement Learning: An Introduction" Second Edition
by Richard S. Sutton and Andrew G. Barto
Define abbreviations:
B: batch size representing number of trajectories.
T: number of steps per trajectory. This is equal to N - n in the equation
above.
Args
rewards
Tensor with shape [T, B] (or [T]) representing rewards.
discounts
Tensor with shape [T, B] (or [T]) representing discounts.
final_value
(Optional.). Default: An all zeros tensor. Tensor with shape
[B] (or [1]) representing value estimate at T. This is optional;
when set, it allows final value to bootstrap the reward computation.
time_major
A boolean indicating whether input tensors are time major. False
means input tensors have shape [B, T].
provide_all_returns
A boolean; if True, this will provide all of the
returns by time dimension; if False, this will only give the single
complete discounted return.
Returns
If provide_all_returns:
A tensor with shape [T, B] (or [T]) representing the discounted
returns. The shape is [B, T] when not time_major.
If not provide_all_returns:
A tensor with shape [B] (or []) representing the discounted returns.