[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / wiki/concepts/optimal_exploration.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "Optimal Exploration Policy" type: concept related: [Action Value Function, Policy] source: https://www.jemoka.com/posts/kbhoptimal_exploration/ confidence: high status: active --- Suppose we have offline statistic regarding wins and losses of each slot machine as our state: \begin{equation} w_1, l_{1}, \dots, w_{n}, l_{n} \end{equation} What if we want to create a policy that maximises exploration? We construct a value function: \begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation} our policy is the greedy policy: \begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \arg\max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation} Now, how do we go about calculating the action-value: \begin{align} Q ([w_1, l_{1}, \dots, w_{n}, l_{n}], a) =\ & \frac{w_{a}+1}{w_{a}+l_{a}+2} (R(w) + U^{*}(\dots, w_{a}+1, l_{a}, \dots)) \&+ \qty(1-\frac{w_{a}+1}{w_{a}+l_{a}+2})(R(l) + U^{*}(\dots, w_{a}, l_{a}+1, \dots)) \end{align} “the probability of you win” (expectation of Beta Distribution), times the instantaneous reward you win + the utility you gain in terms of information of you doing that. To solve this in a finite horizon, note that at time \(t=k\), your \(U^{*}\) is \(0\) because you have nothing to do anymore. Then, you can back up slowly to get each previous state: calculate \(Q[w_1-1, l_1, …, 1]\) calculate \(Q[w_1, l_1-1, …,1]\) and so on, and choosing the utility of each state from there.