Average&#x2212;Reward Off&#x2212;Policy Policy Evaluation with Function Approximation

Whiteson, Shimon

Average−Reward Off−Policy Policy Evaluation with Function Approximation

Shangtong Zhang‚ Yi Wan‚ Richard S Sutton and Shimon Whiteson

Abstract

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

Book Title

Proceedings of the 38th International Conference on Machine Learning

Editor

Meila‚ Marina and Zhang‚ Tong

Month

18–24 Jul

Pages

12578–12588

Publisher

PMLR

Series

Proceedings of Machine Learning Research

Volume

139

Year

2021

Average−Reward Off−Policy Policy Evaluation with Function Approximation

Abstract

Links

See Also