Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Scaling LawsModel ArchitectureInference-Efficient

Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a conditional scaling law that extends the Chinchilla framework by incorporating architectural parameters—hidden size, MLP-to-attention ratio, and grouped-query attention—to predict both accuracy and inference cost. It resides in the Architecture-Conditional Scaling Laws leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader Scaling Laws and Predictive Modeling branch, suggesting the integration of architectural factors into scaling law frameworks remains an emerging area despite the maturity of compute-optimal scaling research.

The taxonomy reveals neighboring leaves focused on Training Compute-Optimal Scaling Laws (architecture-agnostic Chinchilla-style analyses) and Inference-Aware Scaling Laws (optimizing for deployment costs without explicit architectural conditioning). The paper bridges these directions by making architectural choices explicit predictors of performance under inference constraints. Nearby branches like Model Architecture Design for Efficiency explore structural innovations (MoE, hybrid models, attention mechanisms) but typically lack unified predictive frameworks. The scope_note for Architecture-Conditional Scaling Laws explicitly excludes architecture-agnostic approaches, positioning this work as addressing a gap between theoretical scaling and practical architectural diversity.

Among 20 candidates examined across three contributions, the conditional scaling law itself shows one refutable candidate among 10 examined, indicating some prior work on architecture-aware prediction exists within the limited search scope. The search framework for identifying inference-efficient architectures found no refutable candidates among 10 examined, suggesting this systematic optimization approach may be less explored. The characterization of architectural factors' impact was not examined against candidates. These statistics reflect a targeted literature search rather than exhaustive coverage, and the single refutable pair for the core contribution suggests the specific formulation may overlap with existing architecture-conditional frameworks.

Based on the limited search scope of 20 candidates, the work appears to occupy a moderately novel position by unifying architectural conditioning with inference-aware optimization in a single predictive framework. The sparse population of its taxonomy leaf and the absence of refutable candidates for the search framework component suggest potential novelty in the systematic approach, though the core scaling law formulation shows some overlap with prior architecture-aware prediction efforts. The analysis does not cover exhaustive comparison with all Chinchilla extensions or architecture search methods beyond the top-K semantic matches examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: architecture-aware scaling laws for inference-efficient language models. The field has evolved to address the dual challenge of predicting model performance while accounting for deployment costs and architectural choices. The taxonomy reveals eight major branches spanning the full lifecycle from design to deployment. Scaling Laws and Predictive Modeling focuses on mathematical frameworks that relate model size, data, and compute to performance, with recent extensions to architecture-conditional settings and inference budgets (Scaling Laws Architecture[0], Inference Scaling Laws[7]). Model Architecture Design for Efficiency explores structural innovations like mixture-of-experts and hybrid architectures (MoE Scaling Laws[2], Hybrid Architecture Analysis[39]), while Inference-Time Compute Scaling examines how test-time computation can be traded for accuracy (Test-Time Compute Scaling[1], Test-Time Reasoning Scaling[3]). Training-Time Scaling, Quantization, and System Infrastructure branches address optimization strategies, low-bit representations (Ternary Scaling Laws[49]), and practical deployment concerns (Cost Modeling LLMs[33]), with additional branches covering multimodal extensions and empirical benchmarking (LLM Efficiency Evaluation[11]). A central tension emerges between predictive accuracy and practical deployment constraints. Works like Inference-Efficient Models[8] and Beyond Chinchilla[13] challenge traditional compute-optimal training by incorporating inference costs into scaling decisions, while Inference Economics[31] explicitly models the economic trade-offs. Scaling Laws Architecture[0] sits within the Architecture-Conditional Scaling Laws cluster, emphasizing how different architectural families—dense transformers, MoE variants, or hybrid designs—exhibit distinct scaling behaviors that must be captured for accurate performance prediction under inference budgets. This contrasts with earlier work that treated architecture as fixed, and complements neighboring efforts like Inference-Efficient Models[8] which focus more on empirical comparisons across architectures. The original paper's emphasis on architecture-aware prediction bridges the gap between theoretical scaling frameworks and the practical reality that deployment efficiency depends critically on structural choices, a theme echoed across multiple branches but rarely integrated into unified predictive models.

Claimed Contributions

Conditional scaling law augmenting Chinchilla with architectural factors

Can Refute

10 retrieved papers

The authors propose a conditional extension of the Chinchilla scaling laws that incorporates architectural parameters such as hidden size, mlp-to-attention ratio, and grouped-query attention. This framework enables predicting model performance while accounting for architectural design choices.

10 retrieved papers

Can Refute

Search framework for inference-efficient and accurate architectures

10 retrieved papers

The authors develop a systematic framework (Algorithm 1) that uses the conditional scaling law to identify model architectures optimizing both inference efficiency and accuracy under fixed parameter and token budgets, including a local search procedure for grouped-query attention.

10 retrieved papers

Characterization of architectural factors' impact on inference and accuracy

0 retrieved papers

The authors systematically study how hidden size, mlp-to-attention ratio, and GQA affect both inference throughput and model accuracy by training over 200 models ranging from 80M to 3B parameters, revealing U-shaped relationships between these factors and training loss.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Scaling Inference-Efficient Language Models PDF

Bian, Song, Yan Minghao, Venkataraman, Shivaram (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Conditional scaling law augmenting Chinchilla with architectural factors

[8] Scaling Inference-Efficient Language Models PDF

Can Refute

[51] Observational scaling laws and the predictability of langauge model performance PDF

Cannot Refute

[52] Physics of language models: Part 3.3, knowledge capacity scaling laws PDF

Cannot Refute

[53] CodeGen2: Lessons for Training LLMs on Programming and Natural Languages PDF

Cannot Refute

[54] Slamming: Training a Speech Language Model on One GPU in a Day PDF

Cannot Refute

[55] AI and Memory Wall PDF

Cannot Refute

[56] Scaling laws with vocabulary: Larger models deserve larger vocabularies PDF

Cannot Refute

[57] Farseer: A Refined Scaling Law in Large Language Models PDF

Cannot Refute

[58] Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? PDF

Cannot Refute

[59] Collaborative performance prediction for large language models PDF

Cannot Refute

Contribution

Search framework for inference-efficient and accurate architectures

[60] Efficient neural architecture search via parameters sharing PDF

Cannot Refute

[61] MARCO: Hardware-Aware Neural Architecture Search for Edge Devices with Multi-Agent Reinforcement Learning and Conformal Prediction Filtering PDF

Cannot Refute

[62] Neural architecture search for resource constrained hardware devices: A survey PDF

Cannot Refute

[63] Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers PDF

Cannot Refute

[64] Mixed precision neural architecture search for energy efficient deep learning PDF

Cannot Refute

[65] Semi-supervised neural architecture search PDF

Cannot Refute

[66] Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search PDF

Cannot Refute

[67] Neural architecture search as multiobjective optimization benchmarks: Problem formulation and performance assessment PDF

Cannot Refute

[68] Hao: Hardware-aware neural architecture optimization for efficient inference PDF

Cannot Refute

[69] Faststereonet: A fast neural architecture search for improving the inference of disparity estimation on resource-limited platforms PDF

Cannot Refute

Contribution

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Scaling Inference-Efficient Language Models PDF

Contribution Analysis

Conditional scaling law augmenting Chinchilla with architectural factors

[8] Scaling Inference-Efficient Language Models PDF

[51] Observational scaling laws and the predictability of langauge model performance PDF

[52] Physics of language models: Part 3.3, knowledge capacity scaling laws PDF

[53] CodeGen2: Lessons for Training LLMs on Programming and Natural Languages PDF

[54] Slamming: Training a Speech Language Model on One GPU in a Day PDF

[55] AI and Memory Wall PDF

[56] Scaling laws with vocabulary: Larger models deserve larger vocabularies PDF

[57] Farseer: A Refined Scaling Law in Large Language Models PDF

[58] Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? PDF

[59] Collaborative performance prediction for large language models PDF

Search framework for inference-efficient and accurate architectures

[60] Efficient neural architecture search via parameters sharing PDF

[61] MARCO: Hardware-Aware Neural Architecture Search for Edge Devices with Multi-Agent Reinforcement Learning and Conformal Prediction Filtering PDF

[62] Neural architecture search for resource constrained hardware devices: A survey PDF

[63] Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers PDF

[64] Mixed precision neural architecture search for energy efficient deep learning PDF

[65] Semi-supervised neural architecture search PDF

[66] Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search PDF

[67] Neural architecture search as multiobjective optimization benchmarks: Problem formulation and performance assessment PDF

[68] Hao: Hardware-aware neural architecture optimization for efficient inference PDF

[69] Faststereonet: A fast neural architecture search for improving the inference of disparity estimation on resource-limited platforms PDF

Characterization of architectural factors' impact on inference and accuracy

Table of Contents