Towards Q-learning the whittle index for restless bandits

Jing Fu, Yoni Nazarathy, Sarat Moka, Peter G. Taylor

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

10 Citations (Scopus)

Abstract

We consider the multi-armed restless bandit problem (RMABP) with an infinite horizon average cost objective. Each arm of the RMABP is associated with a Markov process that operates in two modes: active and passive. At each time slot a controller needs to designate a subset of the arms to be active, of which the associated processes will evolve differently from the passive case. Treated as an optimal control problem, the optimal solution of the RMABP is known to be computationally intractable. In many cases, the Whittle index policy achieves near optimal performance and can be tractably found. Nevertheless, computation of the Whittle indices requires knowledge of the transition matrices of the underlying processes, which are sometimes hidden from decision makers. In this paper, we take first steps towards a tractable and efficient reinforcement learning algorithm for controlling such a system. We setup parallel Q-learning recursions, with each recursion mapping to individual possible values of the Whittle index. We then update these recursions as we control the system, learning an approximation of the Whittle index as time evolves. Tested on several examples, our control outperforms naive priority allocations and nears the performance of the fully-informed Whittle index policy.

Original languageEnglish
Title of host publication2019 Australian and New Zealand Control Conference (ANZCC)
Place of PublicationPiscataway, NJ
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages249-254
Number of pages6
ISBN (Electronic)9781728117867, 9781728117850
ISBN (Print)9781728117874
DOIs
Publication statusPublished - 2019
Externally publishedYes
Event2019 Australian and New Zealand Control Conference, ANZCC 2019 - Auckland, New Zealand
Duration: 27 Nov 201929 Nov 2019

Conference

Conference2019 Australian and New Zealand Control Conference, ANZCC 2019
Country/TerritoryNew Zealand
CityAuckland
Period27/11/1929/11/19

Fingerprint

Dive into the research topics of 'Towards Q-learning the whittle index for restless bandits'. Together they form a unique fingerprint.

Cite this