Suggest edit — Online POMDP Methods

Title

Name

Note

---
title: "Online POMDP Methods"
type: concept
related: [Pomdp Approximation, Monte Carlo Tree Search, Action Value Function, Receeding Horizon, Forward Search]
source: https://www.jemoka.com/posts/kbhonline_pomdp_methods/
confidence: high
status: active
---

These are basically MDP methods but tweaked. We make some changes:
for everywhere that we need a state, we use a belief to sample the next state given an action (random next step), we call our generative model to get a new observation, and call update(b,a,o) with our filter to propegate our belief forward if we need an action-value, we use the one-step lookahead in POMDP: \begin{equation} Q(b,a) = R(b,a)+\gamma \qty(\sum_{o}^{}P(o|b,a) U^{\Gamma}(update(b,a,o))) \end{equation}
where,
\begin{equation} R(b,a) = \sum_{s}^{} R(s,a)b(s) \end{equation}
\begin{align} P(o|b,a) &amp;= \sum_{s}^{} p(o|s,a) b(s) \\ &amp;= \sum_{s}^{} \sum_{s&rsquo;}^{} T(s&rsquo;|s,a) O(o|s&rsquo;,a) b(s) \end{align}
and where, if needed (i.e. most algorithms estimate this):
\begin{equation} U^{\Gamma}(b) = \max_{\alpha \in \Gamma} \alpha^{\top} b \end{equation}
we also revise our generative model:
each step requires belief and action, and we sample from our belief a next state, propegate belief forward, and use a traditional generative model to get the rewards and next states (which we don&rsquo;t use).
Receeding Horizon: plan to a depth \(d\), select action, replan Rollout with Lookahead: simple to implement, no grantees of optimality or even boundedness Forward Search: quite expensive&mdash;exponential given the size of horizon monte-carlo tree search, but instead our counts are stored not in terms of states (which we don&rsquo;t know), but sequences of action observations: \(h = a_1o_2a_2o_1a_2o_1\) etc. Then, the counter takes \(N(h,a)\) as input: will head towards optimality and it requires a generative model to sample tracks Branch and Bound, but you use the POMDP Approximation methods to estimate the upper and lower bounds of your utility: its Forward Search with pruning Sparse Sampling: its Forward Search, but the next action-value is determined by a finite sampling of next observations and rewards and you average their future utility. that is, the action-value before depth \(d\) is obtained by: \(Q(b,a) = \frac{1}{m} \sum_{i=1}^{m} \qty(r_{a}^{(i)}+\gammaU_{d-1}(Update(b,a,o_{a}^{(i)})))\)