Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies
Overview
Overall Novelty Assessment
The paper introduces Click-based Inverse Propensity Score (CIPS) and Click-based Doubly Robust (CDR) estimators for off-policy evaluation of ranking policies under deterministic logging. According to the taxonomy, it occupies the 'Click-Based IPS and Doubly Robust Estimators' leaf, which currently contains only this paper as a sibling. This leaf sits within the broader 'Click-Based Importance Weighting for Ranking' branch, which also includes a single sibling leaf on position-based methods. The taxonomy reveals a relatively sparse research direction, with only nine total papers across the entire field structure.
The taxonomy positions this work within a specialized niche that bridges two broader research areas. The sibling branch 'General Deterministic Policy OPE Methods' addresses deterministic policies across diverse action spaces using kernel-based and doubly robust techniques for continuous actions, while 'Domain-Specific and Application-Driven OPE' tackles concrete applications like personalized pricing and counterfactual learning-to-rank. The paper's focus on exploiting click stochasticity distinguishes it from general-purpose deterministic policy frameworks and from domain-specific methods that do not leverage user interaction randomness as an importance weighting mechanism.
Among fifteen candidates examined, none were found to refute the three core contributions. The CIPS estimator itself was not compared against any candidates. The theoretical analysis of CIPS bias and variance examined five candidates with no refutations, while the CDR estimator extension examined ten candidates, again with no overlapping prior work identified. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of click-based importance weighting and deterministic logging appears relatively unexplored, though the small candidate pool means the analysis cannot claim exhaustive coverage.
The limited search scope and sparse taxonomy structure indicate that this research direction is emerging rather than saturated. The absence of sibling papers in the same leaf and the small number of refutable candidates across all contributions suggest the work occupies a distinct position, though a broader literature search might reveal additional related efforts in adjacent communities or application domains not captured by the semantic search strategy employed here.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce CIPS, a new off-policy evaluation estimator that uses click probability as a form of importance weighting instead of relying on logging policy stochasticity. This enables low-bias OPE even under deterministic logging policies where existing methods fail.
The authors establish formal theoretical guarantees showing that CIPS achieves unbiasedness under click-wise common support and independence of potential rewards conditions, which are less restrictive than conditions required by existing methods. They also characterize the variance of CIPS.
The authors extend CIPS to CDR by incorporating a regression model for expected potential rewards. This extension achieves the same bias as CIPS while reducing variance when the reward model is accurate.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Click-based Inverse Propensity Score (CIPS) estimator
The authors introduce CIPS, a new off-policy evaluation estimator that uses click probability as a form of importance weighting instead of relying on logging policy stochasticity. This enables low-bias OPE even under deterministic logging policies where existing methods fail.
Theoretical analysis of CIPS bias and variance properties
The authors establish formal theoretical guarantees showing that CIPS achieves unbiasedness under click-wise common support and independence of potential rewards conditions, which are less restrictive than conditions required by existing methods. They also characterize the variance of CIPS.
[10] Uncertainty calibration for counterfactual propensity estimation in recommendation PDF
[11] Bilateral Self-unbiased Learning from Biased Implicit Feedback PDF
[12] Offline Evaluation of Ranked Lists using Parametric Estimation of Propensities PDF
[13] Model-based Unbiased Learning to Rank PDF
[14] Non-Clicks Mean Irrelevant? Propensity Ratio Scoring As a Correction PDF
Click-based Doubly Robust (CDR) estimator extension
The authors extend CIPS to CDR by incorporating a regression model for expected potential rewards. This extension achieves the same bias as CIPS while reducing variance when the reward model is accurate.