Cost-of-Pass: An Economic Framework for Evaluating Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
economic evaluation frameworklanguage-model evaluationcost‑performance trade‑offinference time techniques
Abstract:

The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. Building on production theory, we develop an economically grounded framework for evaluating language models by combining accuracy and inference cost. We formalize cost-of-pass, the expected monetary cost of generating a correct solution. We then define the frontier cost-of-pass as the minimum cost-of-pass achievable across available models or the human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers—estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions from common inference-time techniques (majority voting and self-refinement), and a budget-aware technique (TALE-EP). We find that performance-oriented methods with marginal performance gains rarely justify the costs, while TALE-EP shows some promise. Overall, our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces cost-of-pass, a metric combining accuracy and inference cost to evaluate language models economically. It resides in the 'Cost-Accuracy Tradeoff Metrics and Theoretical Frameworks' leaf, which contains only two papers total. This sparse population suggests the research direction—formal economic frameworks for LLM evaluation—remains relatively underdeveloped compared to optimization-heavy branches like Model Optimization and Compression Techniques or Inference Acceleration Methods, which collectively house over twenty papers. The sibling paper examines training-time scaling laws, whereas this work focuses on deployment-stage cost-effectiveness.

The taxonomy reveals neighboring leaves addressing related but distinct concerns. 'Empirical Model Selection and Routing Systems' contains six papers on dynamic query routing, emphasizing operational deployment rather than theoretical metrics. 'Structural Pruning and Parameter Reduction' and 'Quantization and Low-Rank Approximation' focus on model compression without explicit cost-accuracy formalization. The scope note for the parent branch clarifies that technical optimization methods lacking formal tradeoff frameworks belong elsewhere, positioning this work as foundational theory rather than applied technique. The frontier cost-of-pass concept bridges economic theory and practical model comparison.

Among twenty-one candidates examined, none clearly refute the three core contributions. The cost-of-pass metric examined ten candidates with zero refutations; the frontier framework with human-expert baseline similarly found no overlapping prior work across ten candidates; counterfactual frontier analysis examined one candidate without refutation. This limited search scope—top-K semantic matches plus citation expansion—suggests the contributions appear novel within the examined sample. The sparse sibling count and absence of refutations across all contributions indicate the work occupies relatively unexplored conceptual territory, though exhaustive literature review would strengthen this assessment.

Given the constrained search and sparse taxonomy leaf, the work appears to introduce genuinely new evaluation constructs. The economic grounding and frontier-based analysis distinguish it from empirical routing systems or compression benchmarks. However, the analysis covers only twenty-one candidates from semantic search, leaving open whether broader surveys or domain-specific venues contain related frameworks. The taxonomy structure itself—showing minimal prior work in formal cost-accuracy metrics—corroborates the impression of novelty within the examined scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating language models by combining accuracy and inference cost. The field has organized itself around several complementary perspectives. Economic and Cost-Efficiency Frameworks establish theoretical foundations and metrics for balancing quality against computational expense, as seen in works like Cost-of-Pass[0] and Beyond Chinchilla-Optimal[32]. Model Optimization and Compression Techniques focus on reducing model size and memory footprint through pruning, quantization, and parameter sharing (e.g., Efficient Expert Pruning[5], Joint Pruning Parameter Sharing[6]). Inference Acceleration Methods target runtime speedups via caching, speculative decoding, and adaptive compute strategies (e.g., Reward-Guided Speculative Decoding[18], Adaptive Compute Inference[48]). Task-Specific Evaluation branches examine domain-tailored quality measures, while Multi-Model Collaboration explores routing queries across heterogeneous models (e.g., Hybrid LLM Query Routing[10]), and Specialized Optimization addresses niche settings such as edge deployment or domain-specific constraints. Recent work highlights a tension between static compression and dynamic resource allocation. Some studies pursue aggressive model shrinking to minimize per-token costs, whereas others advocate adaptive inference pipelines that adjust compute on the fly based on query difficulty. Cost-of-Pass[0] sits squarely within the Economic and Cost-Efficiency branch, proposing a unified metric that captures both correctness and the cumulative inference expense required to achieve it. This contrasts with Beyond Chinchilla-Optimal[32], which examines training-time scaling laws but shares a similar philosophy of optimizing total resource budgets rather than isolated accuracy targets. Together, these directions underscore an emerging consensus: evaluating language models demands joint consideration of what they achieve and what they consume, prompting richer frameworks that go beyond traditional benchmark leaderboards.

Claimed Contributions

Cost-of-pass metric for evaluating language models

The authors propose a new metric called cost-of-pass that quantifies the expected monetary cost to achieve a successful output for a given problem. This metric integrates both model performance (probability of correctness) and inference cost into a single economically interpretable measure, adapting concepts from production theory to language model evaluation.

10 retrieved papers
Frontier cost-of-pass framework with human-expert baseline

The authors develop a framework that defines the frontier cost-of-pass as the minimum achievable cost across all available language models and a human-expert baseline. This provides an economically grounded reference point for evaluating whether AI systems offer cost advantages over human labor.

10 retrieved papers
Counterfactual frontier analysis for model family contributions

The authors introduce a counterfactual analysis method that quantifies the essential contribution of different model families (lightweight, large, and reasoning models) to cost-efficiency progress. This reveals which model innovations have been most impactful for different task categories by estimating what the frontier would be without each family.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cost-of-pass metric for evaluating language models

The authors propose a new metric called cost-of-pass that quantifies the expected monetary cost to achieve a successful output for a given problem. This metric integrates both model performance (probability of correctness) and inference cost into a single economically interpretable measure, adapting concepts from production theory to language model evaluation.

Contribution

Frontier cost-of-pass framework with human-expert baseline

The authors develop a framework that defines the frontier cost-of-pass as the minimum achievable cost across all available language models and a human-expert baseline. This provides an economically grounded reference point for evaluating whether AI systems offer cost advantages over human labor.

Contribution

Counterfactual frontier analysis for model family contributions

The authors introduce a counterfactual analysis method that quantifies the essential contribution of different model families (lightweight, large, and reasoning models) to cost-efficiency progress. This reveals which model innovations have been most impactful for different task categories by estimating what the frontier would be without each family.