# RL Agent

At every step, our agent sees the RL Environment and Reward Signal and must make an action. The agent can only make actions which the RL Environment exposes.

The agent at each timestep $$t$$ :

• performance action $$a_t$$
• recieves new observation $$o_t$$
• recieves new reward $$r_t$$

## 1.Agent History

As the agent will get those values at every timestep, we can define the history to be a sequence of observations, actions and rewards that the agent has seen so far.

$H_t = a_1, o_1, r_1, \dots, a_t, o_t, r_t$

(All observations up to time $$t$$)

## 2.Agent State

The Agent will take a look at this history in order to make a decision. However the history is really not that useful. There can be far too much data. Instead it would be better if we could create a State that encapsulates the history of the system up to that point. Formally,

$S_t = f(H_t)$

The function $$f$$ here is simply a mapping from $$H_t$$ to $$S_t$$. It is up to us to design this mapping.

The agent can learn its own state $$S^a_t$$ built from what its seen. We build this from $$H_t$$ since this is all data that is given to the agent. We will use this state to make deicions.

## 3.Major Components of the RL Agent

The agent typicallyl includes the following pieces:

The agent doesn't have to make use of all of these components, but will typically use some of them. We can taxonomize different Agents based on which components they have. They can have the following types:

Each of these agents can now have an internal Environment Model.

Created: 2021-11-13

Emacs 26.1 (Org mode 9.5)