Regret and Convergence Bounds for Continuum-Armed Bandits
By Eric Cope
This talk will consider immediate-reward reinforcement learning problems with
continuous action sets, also known as “continuum-armed bandit problems.” Both
lower bounds on the growth rate of the regret as well as upper bounds on the
rates of convergence of the control values to the optimum are derived. I
explicitly characterize the dependence of these convergence rates on the minimal
rate of variation of the mean reward function in a neighborhood of the optimal
control. The bounds can be used to demonstrate the asymptotic optimality of the
Kiefer-Wolfowitz method of stochastic approximation with regard to a large class
of possible mean reward functions.