Suggest edit — Optimal Exploration Policy

Title

Name

Note

---
title: "Optimal Exploration Policy"
type: concept
related: [Action Value Function, Policy]
source: https://www.jemoka.com/posts/kbhoptimal_exploration/
confidence: high
status: active
---

Suppose we have offline statistic regarding wins and losses of each slot machine as our state:
\begin{equation} w_1, l_{1}, \dots, w_{n}, l_{n} \end{equation}
What if we want to create a policy that maximises exploration?
We construct a value function:
\begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation}
our policy is the greedy policy:
\begin{equation} U^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}]) = \arg\max_{a} Q^{*}([w_1, l_{1}, \dots, w_{n}, l_{n}], a) \end{equation}
Now, how do we go about calculating the action-value:
\begin{align} Q ([w_1, l_{1}, \dots, w_{n}, l_{n}], a) =\ &amp; \frac{w_{a}+1}{w_{a}+l_{a}+2} (R(w) + U^{*}(\dots, w_{a}+1, l_{a}, \dots)) \&amp;+ \qty(1-\frac{w_{a}+1}{w_{a}+l_{a}+2})(R(l) + U^{*}(\dots, w_{a}, l_{a}+1, \dots)) \end{align}
&ldquo;the probability of you win&rdquo; (expectation of Beta Distribution), times the instantaneous reward you win + the utility you gain in terms of information of you doing that.
To solve this in a finite horizon, note that at time \(t=k\), your \(U^{*}\) is \(0\) because you have nothing to do anymore.
Then, you can back up slowly to get each previous state:
calculate \(Q[w_1-1, l_1, &hellip;, 1]\) calculate \(Q[w_1, l_1-1, &hellip;,1]\) and so on, and choosing the utility of each state from there.