Reward Signal

about | blog | config | notes | github

The goal of the RL Agent is to get as much reward as possible. At every timestep, the RL Agent gets \(R_t\) as a scalar feedback signal from the Environment and so we wish to maximize the cumulative sum of this signal.

This is the reward hypothesis:

All goals can be described by the maximization of expected cumulative reward

"Is this statement really valid though?" David Silver asks…

We can define the reward to be given at any point. We can give the +ve/-ve reward after every action or after a sequence of actions. In even more extreme cases, we can give reward at the end of the game (if we win or loose in chess for example). Another strategy is simply giving reward equivalent to the change in score of a game. Defining a reward function actually turns to be a challenging task in its own right for RL.

Because of the nature of rewards, they can be delayed. We cannot be greedy and look for short term rewards. It might be better to forgo short term rewards and instead get a much larger reward in the future.

Created: 2021-11-13

Emacs 26.1 (Org mode 9.5)