Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies

ICLR 2026 Conference SubmissionAnonymous Authors
off-policy evaluation; ranking; common support; deterministic logging
Abstract:

Off-Policy Evaluation (OPE) is an important practical problem in algorithmic ranking systems, where the goal is to estimate the expected performance of a new ranking policy using only offline logged data collected under a different, logging policy. Existing estimators, such as the ranking-wise and position-wise inverse propensity score (IPS) estimators, require the data collection policy to be sufficiently stochastic and suffer from severe bias when the logging policy is deterministic. In this paper, we propose novel estimators, Click-based Inverse Propensity Score (CIPS) and Click-based Doubly Robust (CDR), which exploit the intrinsic stochasticity of user click behavior to address this challenge. Unlike existing methods that rely on the stochasticity of the logging policy, our approach uses click probability as a new form of importance weighting, enabling low-bias OPE even under deterministic logging policies where existing methods incur substantial bias. We provide theoretical analyses of the bias and variance properties of the proposed estimators and show, through synthetic and real-world experiments, that our estimators achieve significantly lower bias compared to strong baselines, particularly in settings with completely deterministic logging policies.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Click-based Inverse Propensity Score (CIPS) and Click-based Doubly Robust (CDR) estimators for off-policy evaluation of ranking policies under deterministic logging. According to the taxonomy, it occupies the 'Click-Based IPS and Doubly Robust Estimators' leaf, which currently contains only this paper as a sibling. This leaf sits within the broader 'Click-Based Importance Weighting for Ranking' branch, which also includes a single sibling leaf on position-based methods. The taxonomy reveals a relatively sparse research direction, with only nine total papers across the entire field structure.

The taxonomy positions this work within a specialized niche that bridges two broader research areas. The sibling branch 'General Deterministic Policy OPE Methods' addresses deterministic policies across diverse action spaces using kernel-based and doubly robust techniques for continuous actions, while 'Domain-Specific and Application-Driven OPE' tackles concrete applications like personalized pricing and counterfactual learning-to-rank. The paper's focus on exploiting click stochasticity distinguishes it from general-purpose deterministic policy frameworks and from domain-specific methods that do not leverage user interaction randomness as an importance weighting mechanism.

Among fifteen candidates examined, none were found to refute the three core contributions. The CIPS estimator itself was not compared against any candidates. The theoretical analysis of CIPS bias and variance examined five candidates with no refutations, while the CDR estimator extension examined ten candidates, again with no overlapping prior work identified. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of click-based importance weighting and deterministic logging appears relatively unexplored, though the small candidate pool means the analysis cannot claim exhaustive coverage.

The limited search scope and sparse taxonomy structure indicate that this research direction is emerging rather than saturated. The absence of sibling papers in the same leaf and the small number of refutable candidates across all contributions suggest the work occupies a distinct position, though a broader literature search might reveal additional related efforts in adjacent communities or application domains not captured by the semantic search strategy employed here.

Taxonomy

Core-task Taxonomy Papers
9
3
Claimed Contributions
15
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: off-policy evaluation for ranking policies under deterministic logging. The field addresses the challenge of estimating the performance of new ranking strategies when historical data comes from a fixed, deterministic logging policy that always selects the same items for each context. The taxonomy reveals four main branches. Click-Based Importance Weighting for Ranking develops specialized inverse propensity scoring and doubly robust estimators that exploit user click feedback to handle position bias and deterministic exposure. General Deterministic Policy OPE Methods provide broader frameworks applicable beyond ranking, tackling the zero-probability problem inherent in deterministic logs. Domain-Specific and Application-Driven OPE tailors evaluation techniques to concrete settings such as advertising, personalized pricing, and recommendation systems. Actor-Critic Methods for Average Reward, while somewhat distinct, contribute policy gradient approaches that can inform evaluation under long-term average reward criteria. A particularly active line of work centers on click-based importance weighting, where methods like Inverse Propensity Ranking[6] and Doubly Robust Deterministic[4] refine variance reduction and bias correction for ranking contexts. These approaches contrast with more general deterministic-policy frameworks such as General Logging Policies[8], which address arbitrary action spaces without relying on click signals. Meanwhile, application-driven studies like Counterfactual Learning Ads[5] and Balanced Personalized Pricing[3] demonstrate how domain constraints shape estimator design. Ranking Deterministic Logging[0] sits squarely within the click-based importance weighting cluster, sharing the focus on position-aware propensity scores and doubly robust corrections seen in Doubly Robust Deterministic[4] and Inverse Propensity Ranking[6]. Its emphasis on deterministic logging distinguishes it from stochastic-logging baselines, highlighting the unique identifiability and variance challenges that arise when the logging policy has no exploration randomness.

Claimed Contributions

Click-based Inverse Propensity Score (CIPS) estimator

The authors introduce CIPS, a new off-policy evaluation estimator that uses click probability as a form of importance weighting instead of relying on logging policy stochasticity. This enables low-bias OPE even under deterministic logging policies where existing methods fail.

0 retrieved papers
Theoretical analysis of CIPS bias and variance properties

The authors establish formal theoretical guarantees showing that CIPS achieves unbiasedness under click-wise common support and independence of potential rewards conditions, which are less restrictive than conditions required by existing methods. They also characterize the variance of CIPS.

5 retrieved papers
Click-based Doubly Robust (CDR) estimator extension

The authors extend CIPS to CDR by incorporating a regression model for expected potential rewards. This extension achieves the same bias as CIPS while reducing variance when the reward model is accurate.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Click-based Inverse Propensity Score (CIPS) estimator

The authors introduce CIPS, a new off-policy evaluation estimator that uses click probability as a form of importance weighting instead of relying on logging policy stochasticity. This enables low-bias OPE even under deterministic logging policies where existing methods fail.

Contribution

Theoretical analysis of CIPS bias and variance properties

The authors establish formal theoretical guarantees showing that CIPS achieves unbiasedness under click-wise common support and independence of potential rewards conditions, which are less restrictive than conditions required by existing methods. They also characterize the variance of CIPS.

Contribution

Click-based Doubly Robust (CDR) estimator extension

The authors extend CIPS to CDR by incorporating a regression model for expected potential rewards. This extension achieves the same bias as CIPS while reducing variance when the reward model is accurate.

Off-Policy Evaluation for Ranking Policies under Deterministic Logging Policies | Novelty Validation