Suggest edit — policy

Title

Name

Note

---
title: "policy"
type: concept
related: [Policy, Current, Markov Decision Process, Utility Function, Utility Theory]
source: https://www.jemoka.com/posts/kbhpolicy/
confidence: high
status: active
---

constituents the history: last states and actions \(h_{t} = (s_{1:t}, a_{1:t-1})\)
requirements typically:
\begin{equation} a_{t} = \pi_{t}(h_{t}) \end{equation}
for a Markov Decision Process, our past states are d-seperated from our current action given knowing the state, so really we have \(\pi_{t}(s_{t})\)
Some policies can be stochastic:
\begin{equation} P(a_{t}) = \pi_{t}(a_{t} | h_{t}) \end{equation}
instead of telling you something to do at a specific point, it tells you what the probability it chooses of doing \(a_{t}\) is given the history.
additional information stationary policy For infinite-horizon models, our policy can not care about how many time stamps are left (i.e. we are not optimizing within some box with constrained time) and therefore we don&rsquo;t really care about historical actions. So we have:
\begin{equation} \pi(s) \end{equation}
this can be used in infinite-horizon models against stationary Markov Decision Process.
optimal policy \begin{equation} \pi^{*}(s) = \arg\max_{\pi} U^{\pi}(s) \end{equation}
&ldquo;the most optimal policy is the policy that maximizes the expected utility of following \(\pi\) when starting from \(s\)&rdquo;
We call the utility from the best policy the &ldquo;optimal value function&rdquo;
\begin{equation} U^{*} = U^{\pi^{*}} \end{equation}
policy utility, and value creating a good utility function: either policy evaluation or value iteration creating a policy from a utility function: value-function policy (&ldquo;choose the policy that takes the best valued action&rdquo;) calculating the utility function a policy currently uses: use policy evaluation See policy evaluation