Suggest edit — MaxQ

Title

Name

Note

---
title: "MaxQ"
source: https://www.jemoka.com/posts/kbhmaxq/
---

Two Abstractions &ldquo;temporal abstractions&rdquo;: making decisions without consideration / abstracting away time (MDP) &ldquo;state abstractions&rdquo;: making decisions about groups of states at once Graph MaxQ formulates a policy as a graph, which formulates a set of \(n\) policies
Max Node This is a &ldquo;policy node&rdquo;, connected to a series of \(Q\) nodes from which it takes the max and propegate down. If we are at a leaf max-node, the actual action is taken and control is passed back t to the top of the graph
Q Node each node computes \(Q(S,A)\) for a value at that action
Hierachical Value Function \begin{equation} Q(s,a) = V_{a}(s) + C_{i}(s,a) \end{equation}
the value function of the root node is the value obtained over all nodes in the graph
where:
\begin{equation} C_{i}(s,a) = \sum_{s&rsquo;}^{} P(s&rsquo;|s,a) V(s&rsquo;) \end{equation}
Learning MaxQ maintain two tables \(C_{i}\) and \(\tilde{C}_{i}(s,a)\) (which is a special completion function which corresponds to a special reward \(\tilde{R}\) which prevents the model from doing egregious ending actions) choose \(a\) according to exploration strategy execute \(a\), observe \(s&rsquo;\), and compute \(R(s&rsquo;|s,a)\) Then, update: