[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / wiki/concepts/online_pomdp_methods.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "Online POMDP Methods" type: concept related: [Pomdp Approximation, Monte Carlo Tree Search, Action Value Function, Receeding Horizon, Forward Search] source: https://www.jemoka.com/posts/kbhonline_pomdp_methods/ confidence: high status: active --- These are basically MDP methods but tweaked. We make some changes: for everywhere that we need a state, we use a belief to sample the next state given an action (random next step), we call our generative model to get a new observation, and call update(b,a,o) with our filter to propegate our belief forward if we need an action-value, we use the one-step lookahead in POMDP: \begin{equation} Q(b,a) = R(b,a)+\gamma \qty(\sum_{o}^{}P(o|b,a) U^{\Gamma}(update(b,a,o))) \end{equation} where, \begin{equation} R(b,a) = \sum_{s}^{} R(s,a)b(s) \end{equation} \begin{align} P(o|b,a) &= \sum_{s}^{} p(o|s,a) b(s) \\ &= \sum_{s}^{} \sum_{s’}^{} T(s’|s,a) O(o|s’,a) b(s) \end{align} and where, if needed (i.e. most algorithms estimate this): \begin{equation} U^{\Gamma}(b) = \max_{\alpha \in \Gamma} \alpha^{\top} b \end{equation} we also revise our generative model: each step requires belief and action, and we sample from our belief a next state, propegate belief forward, and use a traditional generative model to get the rewards and next states (which we don’t use). Receeding Horizon: plan to a depth \(d\), select action, replan Rollout with Lookahead: simple to implement, no grantees of optimality or even boundedness Forward Search: quite expensive—exponential given the size of horizon monte-carlo tree search, but instead our counts are stored not in terms of states (which we don’t know), but sequences of action observations: \(h = a_1o_2a_2o_1a_2o_1\) etc. Then, the counter takes \(N(h,a)\) as input: will head towards optimality and it requires a generative model to sample tracks Branch and Bound, but you use the POMDP Approximation methods to estimate the upper and lower bounds of your utility: its Forward Search with pruning Sparse Sampling: its Forward Search, but the next action-value is determined by a finite sampling of next observations and rewards and you average their future utility. that is, the action-value before depth \(d\) is obtained by: \(Q(b,a) = \frac{1}{m} \sum_{i=1}^{m} \qty(r_{a}^{(i)}+\gammaU_{d-1}(Update(b,a,o_{a}^{(i)})))\)