Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

off-policy evaluation; ranking; common support; deterministic logging

Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS) and Click-based Doubly Robust (CDR), which exploit the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, particularly in settings with completely deterministic logging policies.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Click-based Inverse Propensity Score (CIPS) and Click-based Doubly Robust (CDR) estimators for off-policy evaluation of ranking policies under deterministic logging. According to the taxonomy, it occupies the 'Click-Based IPS and Doubly Robust Estimators' leaf, which currently contains only this paper as a sibling. This leaf sits within the broader 'Click-Based Importance Weighting for Ranking' branch, which also includes a single sibling leaf on position-based methods. The taxonomy reveals a relatively sparse research direction, with only nine total papers across the entire field structure.

The taxonomy positions this work within a specialized niche that bridges two broader research areas. The sibling branch 'General Deterministic Policy OPE Methods' addresses deterministic policies across diverse action spaces using kernel-based and doubly robust techniques for continuous actions, while 'Domain-Specific and Application-Driven OPE' tackles concrete applications like personalized pricing and counterfactual learning-to-rank. The paper's focus on exploiting click stochasticity distinguishes it from general-purpose deterministic policy frameworks and from domain-specific methods that do not leverage user interaction randomness as an importance weighting mechanism.

Among fifteen candidates examined, none were found to refute the three core contributions. The CIPS estimator itself was not compared against any candidates. The theoretical analysis of CIPS bias and variance examined five candidates with no refutations, while the CDR estimator extension examined ten candidates, again with no overlapping prior work identified. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of click-based importance weighting and deterministic logging appears relatively unexplored, though the small candidate pool means the analysis cannot claim exhaustive coverage.

The limited search scope and sparse taxonomy structure indicate that this research direction is emerging rather than saturated. The absence of sibling papers in the same leaf and the small number of refutable candidates across all contributions suggest the work occupies a distinct position, though a broader literature search might reveal additional related efforts in adjacent communities or application domains not captured by the semantic search strategy employed here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: off-policy evaluation for ranking policies under deterministic logging. The field addresses the challenge of estimating the performance of new ranking strategies when historical data comes from a fixed, deterministic logging policy that always selects the same items for each context. The taxonomy reveals four main branches. Click-Based Importance Weighting for Ranking develops specialized inverse propensity scoring and doubly robust estimators that exploit user click feedback to handle position bias and deterministic exposure. General Deterministic Policy OPE Methods provide broader frameworks applicable beyond ranking, tackling the zero-probability problem inherent in deterministic logs. Domain-Specific and Application-Driven OPE tailors evaluation techniques to concrete settings such as advertising, personalized pricing, and recommendation systems. Actor-Critic Methods for Average Reward, while somewhat distinct, contribute policy gradient approaches that can inform evaluation under long-term average reward criteria. A particularly active line of work centers on click-based importance weighting, where methods like Inverse Propensity Ranking[6] and Doubly Robust Deterministic[4] refine variance reduction and bias correction for ranking contexts. These approaches contrast with more general deterministic-policy frameworks such as General Logging Policies[8], which address arbitrary action spaces without relying on click signals. Meanwhile, application-driven studies like Counterfactual Learning Ads[5] and Balanced Personalized Pricing[3] demonstrate how domain constraints shape estimator design. Ranking Deterministic Logging[0] sits squarely within the click-based importance weighting cluster, sharing the focus on position-aware propensity scores and doubly robust corrections seen in Doubly Robust Deterministic[4] and Inverse Propensity Ranking[6]. Its emphasis on deterministic logging distinguishes it from stochastic-logging baselines, highlighting the unique identifiability and variance challenges that arise when the logging policy has no exploration randomness.

Claimed Contributions

Click-based Inverse Propensity Score (CIPS) estimator

0 retrieved papers

The authors introduce CIPS, a new off-policy evaluation estimator that uses click probability as a form of importance weighting instead of relying on logging policy stochasticity. This enables low-bias OPE even under deterministic logging policies where existing methods fail.

0 retrieved papers

Theoretical analysis of CIPS bias and variance properties

5 retrieved papers

The authors establish formal theoretical guarantees showing that CIPS achieves unbiasedness under click-wise common support and independence of potential rewards conditions, which are less restrictive than conditions required by existing methods. They also characterize the variance of CIPS.

5 retrieved papers

Click-based Doubly Robust (CDR) estimator extension

10 retrieved papers

The authors extend CIPS to CDR by incorporating a regression model for expected potential rewards. This extension achieves the same bias as CIPS while reducing variance when the reward model is accurate.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Click-based Inverse Propensity Score (CIPS) estimator

Contribution

Theoretical analysis of CIPS bias and variance properties

[10] Uncertainty calibration for counterfactual propensity estimation in recommendation PDF

Cannot Refute

[11] Bilateral Self-unbiased Learning from Biased Implicit Feedback PDF

Cannot Refute

[12] Offline Evaluation of Ranked Lists using Parametric Estimation of Propensities PDF

Cannot Refute

[13] Model-based Unbiased Learning to Rank PDF

Cannot Refute

[14] Non-Clicks Mean Irrelevant? Propensity Ratio Scoring As a Correction PDF

Cannot Refute

Contribution

Click-based Doubly Robust (CDR) estimator extension

[15] Efficiency-improved doubly robust estimation with non-confounding predictive covariates PDF

Cannot Refute

[16] Doubly robust estimation of marginal cumulative incidence curves for competing risk analysis PDF

Cannot Refute

[17] Augmented Inverse Probability Weighting and the Double Robustness Property PDF

Cannot Refute

[18] Auto-Doubly Robust Estimation of Causal Effects on a Network PDF

Cannot Refute

[19] Propensity score weighting analysis and treatment effect discovery PDF

Cannot Refute

[20] Model misspecification and bias for inverse probability weighting estimators of average causal effects PDF

Cannot Refute

[21] Policy learning âwithoutâ overlap: Pessimism and generalized empirical Bernstein's inequality PDF

Cannot Refute

[22] Inverse probability of treatment weighting with generalized linear outcome models for doubly robust estimation PDF

Cannot Refute

[23] Bounded, efficient and doubly robust estimation with inverse weighting PDF

Cannot Refute

[24] Analysis of incomplete data using inverse probability weighting and doubly robust estimators PDF

Cannot Refute

Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Click-based Inverse Propensity Score (CIPS) estimator

Theoretical analysis of CIPS bias and variance properties

[10] Uncertainty calibration for counterfactual propensity estimation in recommendation PDF

[11] Bilateral Self-unbiased Learning from Biased Implicit Feedback PDF

[12] Offline Evaluation of Ranked Lists using Parametric Estimation of Propensities PDF

[13] Model-based Unbiased Learning to Rank PDF

[14] Non-Clicks Mean Irrelevant? Propensity Ratio Scoring As a Correction PDF

Click-based Doubly Robust (CDR) estimator extension

[15] Efficiency-improved doubly robust estimation with non-confounding predictive covariates PDF

[16] Doubly robust estimation of marginal cumulative incidence curves for competing risk analysis PDF

[17] Augmented Inverse Probability Weighting and the Double Robustness Property PDF

[18] Auto-Doubly Robust Estimation of Causal Effects on a Network PDF

[19] Propensity score weighting analysis and treatment effect discovery PDF

[20] Model misspecification and bias for inverse probability weighting estimators of average causal effects PDF

[21] Policy learning âwithoutâ overlap: Pessimism and generalized empirical Bernstein's inequality PDF

[22] Inverse probability of treatment weighting with generalized linear outcome models for doubly robust estimation PDF

[23] Bounded, efficient and doubly robust estimation with inverse weighting PDF

[24] Analysis of incomplete data using inverse probability weighting and doubly robust estimators PDF

Table of Contents

[21] Policy learning âwithoutâ overlap: Pessimism and generalized empirical Bernstein's inequality PDF