[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / wiki/concepts/policy.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "policy" type: concept related: [Policy, Current, Markov Decision Process, Utility Function, Utility Theory] source: https://www.jemoka.com/posts/kbhpolicy/ confidence: high status: active --- constituents the history: last states and actions \(h_{t} = (s_{1:t}, a_{1:t-1})\) requirements typically: \begin{equation} a_{t} = \pi_{t}(h_{t}) \end{equation} for a Markov Decision Process, our past states are d-seperated from our current action given knowing the state, so really we have \(\pi_{t}(s_{t})\) Some policies can be stochastic: \begin{equation} P(a_{t}) = \pi_{t}(a_{t} | h_{t}) \end{equation} instead of telling you something to do at a specific point, it tells you what the probability it chooses of doing \(a_{t}\) is given the history. additional information stationary policy For infinite-horizon models, our policy can not care about how many time stamps are left (i.e. we are not optimizing within some box with constrained time) and therefore we don’t really care about historical actions. So we have: \begin{equation} \pi(s) \end{equation} this can be used in infinite-horizon models against stationary Markov Decision Process. optimal policy \begin{equation} \pi^{*}(s) = \arg\max_{\pi} U^{\pi}(s) \end{equation} “the most optimal policy is the policy that maximizes the expected utility of following \(\pi\) when starting from \(s\)” We call the utility from the best policy the “optimal value function” \begin{equation} U^{*} = U^{\pi^{*}} \end{equation} policy utility, and value creating a good utility function: either policy evaluation or value iteration creating a policy from a utility function: value-function policy (“choose the policy that takes the best valued action”) calculating the utility function a policy currently uses: use policy evaluation See policy evaluation