Computes discounted return.

Q_n = sum_{n'=n}^N gamma^(n'-n) * r_{n'} + gamma^(N-n+1)*final_value.

For details, see "Reinforcement Learning: An Introduction" Second Edition by Richard S. Sutton and Andrew G. Barto

Define abbreviations:

B: batch size representing number of trajectories. T: number of steps per trajectory. This is equal to N - n in the equation above.

rewards Tensor with shape [T, B] (or [T]) representing rewards.
discounts Tensor with shape [T, B] (or [T]) representing discounts.
final_value (Optional.). Default: An all zeros tensor. Tensor with shape [B] (or [1]) representing value estimate at T. This is optional; when set, it allows final value to bootstrap the reward computation.
time_major A boolean indicating whether input tensors are time major. False means input tensors have shape [B, T].
provide_all_returns A boolean; if True, this will provide all of the returns by time dimension; if False, this will only give the single complete discounted return.

If provide_all_returns: A tensor with shape [T, B] (or [T]) representing the discounted returns. The shape is [B, T] when not time_major. If not provide_all_returns: A tensor with shape [B] (or []) representing the discounted returns.