![]() |
Computes discounted return.
tf_agents.utils.value_ops.discounted_return(
rewards,
discounts,
final_value=None,
time_major=True,
provide_all_returns=True
)
Q_n = sum_{n'=n}^N gamma^(n'-n) * r_{n'} + gamma^(N-n+1)*final_value.
For details, see "Reinforcement Learning: An Introduction" Second Edition by Richard S. Sutton and Andrew G. Barto
Define abbreviations:
B
: batch size representing number of trajectories.
T
: number of steps per trajectory. This is equal to N - n
in the equation
above.
Args | |
---|---|
rewards
|
Tensor with shape [T, B] (or [T] ) representing rewards.
|
discounts
|
Tensor with shape [T, B] (or [T] ) representing discounts.
|
final_value
|
(Optional.). Default: An all zeros tensor. Tensor with shape
[B] (or [1] ) representing value estimate at T . This is optional;
when set, it allows final value to bootstrap the reward computation.
|
time_major
|
A boolean indicating whether input tensors are time major. False
means input tensors have shape [B, T] .
|
provide_all_returns
|
A boolean; if True, this will provide all of the returns by time dimension; if False, this will only give the single complete discounted return. |
Returns | |
---|---|
If provide_all_returns :
A tensor with shape [T, B] (or [T] ) representing the discounted
returns. The shape is [B, T] when not time_major .
If not provide_all_returns :
A tensor with shape [B] (or []) representing the discounted returns.
|