[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / raw/concept/kbhpoint_based_value_iteration.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "Point-Based Value Iteration" source: https://www.jemoka.com/posts/kbhpoint_based_value_iteration/ --- we keep track of a bunch of alpha vectors and belief samples (which we get from point selection): \begin{equation} \Gamma = \{\alpha_{1}, \dots, \alpha_{m}\} \end{equation} and \begin{equation} B = \{b_1, \dots, b_{m}\} \end{equation} To preserve the lower-boundedness of these alpha vectors, one should seed the alpha vectors via something like blind lower bound We can estimate our utility function at any belief by looking in the set for the most optimal: \begin{equation} U^{\Gamma}(b) = \max_{\alpha \in \Gamma} \alpha^{\top}b \end{equation} We now define a function named backup (see PBVI Backup), and call it on all of our beliefs to generate a new set of alpha vectors: \begin{equation} \Gamma^{t+1} = \{backup(\Gamma, b) | b \in B\} \end{equation} where: \begin{equation} \alpha \leftarrow backup(\Gamma, b) \end{equation} therefore we call backup on each \(b\). PBVI Backup backup procedure given \(\Gamma\) and $b$— we want to mint a single new alpha vector by selecting the highest-valued one from the set of good alpha-vectors, one for each action: \begin{equation} \alpha = \arg\max_{\alpha_{a}} \alpha_{a}^{\top} b \end{equation} now, we define each \(\alpha_{a}\) as: \begin{equation} \alpha_{a}(s) = R(s,a) + \gamma \sum_{s’,o}^{} O(o|a,s’)T(s’|s,a)\alpha_{a,o} (s’) \end{equation} where we obtain the old \(\alpha_{a,o}\) by computing vector which currently provides the highest value estimate, which we compute over all actions and observations \(a,o\) given our \(\Gamma\): \begin{equation} \alpha_{a,o} = \arg\max_{\alpha \in \Gamma} \alpha^{\top} update(b,a,o) \end{equation} Randomized PBVI see Perseus