Cost-of-Pass: An Economic Framework for Evaluating Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

economic evaluation frameworklanguage-model evaluationcost‑performance trade‑offinference time techniques

The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. Building on production theory, we develop an economically grounded framework for evaluating language models by combining accuracy and inference cost. We formalize cost-of-pass, the expected monetary cost of generating a correct solution. We then define the frontier cost-of-pass as the minimum cost-of-pass achievable across available models or the human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers—estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions from common inference-time techniques (majority voting and self-refinement), and a budget-aware technique (TALE-EP). We find that performance-oriented methods with marginal performance gains rarely justify the costs, while TALE-EP shows some promise. Overall, our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces cost-of-pass, a metric combining accuracy and inference cost to evaluate language models economically. It resides in the 'Cost-Accuracy Tradeoff Metrics and Theoretical Frameworks' leaf, which contains only two papers total. This sparse population suggests the research direction—formal economic frameworks for LLM evaluation—remains relatively underdeveloped compared to optimization-heavy branches like Model Optimization and Compression Techniques or Inference Acceleration Methods, which collectively house over twenty papers. The sibling paper examines training-time scaling laws, whereas this work focuses on deployment-stage cost-effectiveness.

The taxonomy reveals neighboring leaves addressing related but distinct concerns. 'Empirical Model Selection and Routing Systems' contains six papers on dynamic query routing, emphasizing operational deployment rather than theoretical metrics. 'Structural Pruning and Parameter Reduction' and 'Quantization and Low-Rank Approximation' focus on model compression without explicit cost-accuracy formalization. The scope note for the parent branch clarifies that technical optimization methods lacking formal tradeoff frameworks belong elsewhere, positioning this work as foundational theory rather than applied technique. The frontier cost-of-pass concept bridges economic theory and practical model comparison.

Among twenty-one candidates examined, none clearly refute the three core contributions. The cost-of-pass metric examined ten candidates with zero refutations; the frontier framework with human-expert baseline similarly found no overlapping prior work across ten candidates; counterfactual frontier analysis examined one candidate without refutation. This limited search scope—top-K semantic matches plus citation expansion—suggests the contributions appear novel within the examined sample. The sparse sibling count and absence of refutations across all contributions indicate the work occupies relatively unexplored conceptual territory, though exhaustive literature review would strengthen this assessment.

Given the constrained search and sparse taxonomy leaf, the work appears to introduce genuinely new evaluation constructs. The economic grounding and frontier-based analysis distinguish it from empirical routing systems or compression benchmarks. However, the analysis covers only twenty-one candidates from semantic search, leaving open whether broader surveys or domain-specific venues contain related frameworks. The taxonomy structure itself—showing minimal prior work in formal cost-accuracy metrics—corroborates the impression of novelty within the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating language models by combining accuracy and inference cost. The field has organized itself around several complementary perspectives. Economic and Cost-Efficiency Frameworks establish theoretical foundations and metrics for balancing quality against computational expense, as seen in works like Cost-of-Pass[0] and Beyond Chinchilla-Optimal[32]. Model Optimization and Compression Techniques focus on reducing model size and memory footprint through pruning, quantization, and parameter sharing (e.g., Efficient Expert Pruning[5], Joint Pruning Parameter Sharing[6]). Inference Acceleration Methods target runtime speedups via caching, speculative decoding, and adaptive compute strategies (e.g., Reward-Guided Speculative Decoding[18], Adaptive Compute Inference[48]). Task-Specific Evaluation branches examine domain-tailored quality measures, while Multi-Model Collaboration explores routing queries across heterogeneous models (e.g., Hybrid LLM Query Routing[10]), and Specialized Optimization addresses niche settings such as edge deployment or domain-specific constraints. Recent work highlights a tension between static compression and dynamic resource allocation. Some studies pursue aggressive model shrinking to minimize per-token costs, whereas others advocate adaptive inference pipelines that adjust compute on the fly based on query difficulty. Cost-of-Pass[0] sits squarely within the Economic and Cost-Efficiency branch, proposing a unified metric that captures both correctness and the cumulative inference expense required to achieve it. This contrasts with Beyond Chinchilla-Optimal[32], which examines training-time scaling laws but shares a similar philosophy of optimizing total resource budgets rather than isolated accuracy targets. Together, these directions underscore an emerging consensus: evaluating language models demands joint consideration of what they achieve and what they consume, prompting richer frameworks that go beyond traditional benchmark leaderboards.

Claimed Contributions

Cost-of-pass metric for evaluating language models

10 retrieved papers

The authors propose a new metric called cost-of-pass that quantifies the expected monetary cost to achieve a successful output for a given problem. This metric integrates both model performance (probability of correctness) and inference cost into a single economically interpretable measure, adapting concepts from production theory to language model evaluation.

10 retrieved papers

Frontier cost-of-pass framework with human-expert baseline

10 retrieved papers

The authors develop a framework that defines the frontier cost-of-pass as the minimum achievable cost across all available language models and a human-expert baseline. This provides an economically grounded reference point for evaluating whether AI systems offer cost advantages over human labor.

10 retrieved papers

Counterfactual frontier analysis for model family contributions

1 retrieved paper

The authors introduce a counterfactual analysis method that quantifies the essential contribution of different model families (lightweight, large, and reasoning models) to cost-efficiency progress. This reveals which model innovations have been most impactful for different task categories by estimating what the frontier would be without each family.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[32] Beyond chinchilla-optimal: Accounting for inference in language model scaling laws PDF

Sardana, Nikhil, Nikhil Sardana, Portes, Jacob, Jonathan Frankle, Doubov, Sasha, Sasha Doubov, Frankle, Jonathan (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cost-of-pass metric for evaluating language models

[32] Beyond chinchilla-optimal: Accounting for inference in language model scaling laws PDF

Cannot Refute

[39] Scaling Inference-Efficient Language Models PDF

Cannot Refute

[48] An adaptive compute approach to optimize inference efficiency in large language models PDF

Cannot Refute

[62] Compressing context to enhance inference efficiency of large language models PDF

Cannot Refute

[63] Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models PDF

Cannot Refute

[64] Query performance prediction using relevance judgments generated by large language models PDF

Cannot Refute

[65] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling PDF

Cannot Refute

[66] A Scalable Framework for Evaluating Health Language Models PDF

Cannot Refute

[67] Precision or peril: Evaluating code quality from quantized large language models PDF

Cannot Refute

[68] Freeeval: A modular framework for trustworthy and efficient evaluation of large language models PDF

Cannot Refute

Contribution

Frontier cost-of-pass framework with human-expert baseline

[52] Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts PDF

Cannot Refute

[53] AI-augmented construction cost estimation: an ensemble Natural Language Processing (NLP) model to align quantity take-offs with cost indexes PDF

Cannot Refute

[54] Large language models reduce agency costs PDF

Cannot Refute

[55] Large language models as a substitute for human experts in annotating political text PDF

Cannot Refute

[56] Cascaded Language Models for Cost-effective Human-AI Decision-Making PDF

Cannot Refute

[57] Analysis of LLMs vs Human Experts in Requirements Engineering PDF

Cannot Refute

[58] Large Language Models are Effective Priors for Causal Graph Discovery PDF

Cannot Refute

[59] Federal Revenue When AI Replaces Labor PDF

Cannot Refute

[60] LLMs Can Assist with Proposal Selection at Large User Facilities PDF

Cannot Refute

[61] Large Language Models Augment or Substitute Human Experts in Idea Screening PDF

Cannot Refute

Contribution

Counterfactual frontier analysis for model family contributions

[51] Explainable AI models for cost aware fraud identification in banking PDF

Cannot Refute

Cost-of-Pass: An Economic Framework for Evaluating Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[32] Beyond chinchilla-optimal: Accounting for inference in language model scaling laws PDF

Contribution Analysis

Cost-of-pass metric for evaluating language models

[32] Beyond chinchilla-optimal: Accounting for inference in language model scaling laws PDF

[39] Scaling Inference-Efficient Language Models PDF

[48] An adaptive compute approach to optimize inference efficiency in large language models PDF

[62] Compressing context to enhance inference efficiency of large language models PDF

[63] Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models PDF

[64] Query performance prediction using relevance judgments generated by large language models PDF

[65] Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling PDF

[66] A Scalable Framework for Evaluating Health Language Models PDF

[67] Precision or peril: Evaluating code quality from quantized large language models PDF

[68] Freeeval: A modular framework for trustworthy and efficient evaluation of large language models PDF

Frontier cost-of-pass framework with human-expert baseline

[52] Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts PDF

[53] AI-augmented construction cost estimation: an ensemble Natural Language Processing (NLP) model to align quantity take-offs with cost indexes PDF

[54] Large language models reduce agency costs PDF

[55] Large language models as a substitute for human experts in annotating political text PDF

[56] Cascaded Language Models for Cost-effective Human-AI Decision-Making PDF

[57] Analysis of LLMs vs Human Experts in Requirements Engineering PDF

[58] Large Language Models are Effective Priors for Causal Graph Discovery PDF

[59] Federal Revenue When AI Replaces Labor PDF

[60] LLMs Can Assist with Proposal Selection at Large User Facilities PDF

[61] Large Language Models Augment or Substitute Human Experts in Idea Screening PDF

Counterfactual frontier analysis for model family contributions

[51] Explainable AI models for cost aware fraud identification in banking PDF

Table of Contents