Skip to main content

Exploration-Exploitation in Multi-Agent Learning

Stefanos Leonardos ( King’s College )

Exploration-exploitation is a powerful and practical tool in multi-agent learning; however, its effects in the performance of learning algorithms are far from understood. To make progress in this problem, we study smooth Q-learning (SQL), a continuous-time variant of the standard Q-learning algorithm. We start by showing (1) that SQL has bounded regret in arbitrary games for a cost model that explicitly captures the trade-off between utility maximization and exploration costs and (2) that it always converges to the set of quantal-response equilibria (QRE), the standard solution concept for games under bounded rationality, in weighted potential (i.e., coordination) games with heterogeneous learning agents. In our main task, we then turn to measure the effect of exploration in collective system performance. We characterize the geometry of the QRE surface in low-dimensional systems and link our findings with catastrophe (bifurcation) theory. We show that, over time, the system undergoes phase transitions where the number and stability of equilibria can change dramatically given an infinitesimal change to the exploration hyperparameter. Based on this, we provide a formal treatment of how exploration can provably lead to equilibrium selection with both positive as well as negative (and potentially unbounded) effects to system performance.

We then turn to competitive settings, modelled as weighted zero-sum polymatrix games, and show that SQL always converges to the unique QRE when all agents use positive exploration rates. Fast convergence of SQL to QRE applies regardless of the number of agents and without any need for hyperparameter fine-tuning. As showcased by our experiments, these results provide the necessary guarantees for an algorithmic approach to the currently open problem of equilibrium selection in competitive multi-agent settings.

 

 

Share this: