Regret and Convergence Bounds for Continuum-Armed Bandits

By Eric Cope

This talk will consider immediate-reward reinforcement learning problems with continuous action sets, also known as “continuum-armed bandit problems.” Both lower bounds on the growth rate of the regret as well as upper bounds on the rates of convergence of the control values to the optimum are derived. I explicitly characterize the dependence of these convergence rates on the minimal rate of variation of the mean reward function in a neighborhood of the optimal
control. The bounds can be used to demonstrate the asymptotic optimality of the Kiefer-Wolfowitz method of stochastic approximation with regard to a large class of possible mean reward functions.

Back to the LCI Forum page