View source on GitHub |
Computes discounted return.
tf_agents.utils.value_ops.discounted_return(
rewards,
discounts,
final_value=None,
time_major=True,
provide_all_returns=True
)
Q_n = sum_{n'=n}^N gamma^(n'-n) * r_{n'} + gamma^(N-n+1)*final_value.
For details, see "Reinforcement Learning: An Introduction" Second Edition by Richard S. Sutton and Andrew G. Barto
Define abbreviations:
B
: batch size representing number of trajectories.
T
: number of steps per trajectory. This is equal to N - n
in the equation
above.
Returns | |
---|---|
If provide_all_returns :
A tensor with shape [T, B] (or [T] ) representing the discounted
returns. The shape is [B, T] when not time_major .
If not provide_all_returns :
A tensor with shape [B] (or []) representing the discounted returns.
|