Skip to main content

Doubly Robust Off-policy Evaluation for Reinforcement Learning

Dr Lihong Li ( Microsoft Research )

We study the problem of evaluating a policy that is different from the one that generates data. Such a problem, known as off-policy evaluation in reinforcement learning (RL), is encountered whenever one wants to estimate the value of a new solution, based on historical data, before actually deploying it in the real system, which is a critical step of applying RL in most real-world applications. Despite the fundamental importance of the problem, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the doubly robust estimator for bandits to sequential decision-making problems, to get the best of both worlds: the estimator is guaranteed to be unbiased and has low variance, and as a point estimator, it outperforms the most popular importance-sampling estimator and its variants in most occasions. We also provide theoretical results on the hardness of the problem, and show that our estimator can approach an asymptotic lower bound when no prior knowledge is available about the underlying problem.

 

 

Share this: