[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / raw/concept/kbhoption.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "Option (MDP)" source: https://www.jemoka.com/posts/kbhoption/ --- an Option (MDP) represents a high level collection of actions. Big Picture: abstract away your big policy into \(n\) small policies, and value-iterate over expected values of the big policies. Markov Option A Markov Option is given by a triple \((I, \pi, \beta)\) \(I \subset S\), the states from which the option maybe started \(S \times A\), the MDP during that option \(\beta(s)\), the probability of the option terminating at state \(s\) one-step options You can develop one-shot options, which terminates immediate after one action with underlying probability \(I = \{s:a \in A_{s}\}\) \(\pi(s,a) = 1\) \(\beta(s) = 1\) option value fuction \begin{equation} Q^{\mu}(s,o) = \mathbb{E}\qty[r_{t} + \gamma r_{t+1} + \dots] \end{equation} where \(\mu\) is some option selection process semi-markov decision process a semi-markov decision process is a system over a bunch of options, with time being a factor in option transitions, but the underlying policies still being MDPs. \begin{equation} T(s’, \tau | s,o) \end{equation} where \(\tau\) is time elapsed. because option-level termination induces jumps between large scale states, one backup can propagate to a lot of states. intra option q-learning \begin{equation} Q_{k+1} (s_{i},o) = (1-\alpha_{k})Q_{k}(S_{t}, o) + \alpha_{k} \qty(r_{t+1} + \gamma U_{k}(s_{t+1}, o)) \end{equation} where: \begin{equation} U_{k}(s,o) = (1-\beta(s))Q_{k}(s,o) + \beta(s) \max_{o \in O} Q_{k}(s,o’) \end{equation}