Fidelity‚ Soundness‚ and Efficiency of Interleaved Comparison Methods
Katja Hofmann‚ Shimon Whiteson and Maarten de Rijke
Ranker evaluation is central to the research into search engines, be it to compare rankers or to provide feedback for learning to rank. Traditional evaluation approaches do not scale well because they require explicit relevance judgments of document-query pairs, which are expensive to obtain. A promising alternative is the use of interleaved comparison methods, which compare rankers using click data obtained when interleaving their rankings. In this article, we propose a framework for analyzing interleaved comparison methods. An interleaved comparison method has fidelity if the expected outcome of ranker comparisons properly corresponds to the true relevance of the ranked documents. It is sound if its estimates of that expected outcome are unbiased and consistent. It is efficient if those estimates are accurate with only little data. We analyze existing interleaved comparison methods and find that, while sound, none meet our criteria for fidelity. We propose a probabilistic interleave method, which is sound and has fidelity. We show empirically that, by marginalizing out variables that are known, it is more efficient than existing interleaved comparison methods. Using importance sampling we derive a sound extension that is able to reuse historical data collected in previous comparisons of other ranker pairs.