Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Scaling LawsModel ArchitectureInference-Efficient
Abstract:

Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a conditional scaling law that extends the Chinchilla framework by incorporating architectural parameters—hidden size, MLP-to-attention ratio, and grouped-query attention—to predict both accuracy and inference cost. It resides in the Architecture-Conditional Scaling Laws leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader Scaling Laws and Predictive Modeling branch, suggesting the integration of architectural factors into scaling law frameworks remains an emerging area despite the maturity of compute-optimal scaling research.

The taxonomy reveals neighboring leaves focused on Training Compute-Optimal Scaling Laws (architecture-agnostic Chinchilla-style analyses) and Inference-Aware Scaling Laws (optimizing for deployment costs without explicit architectural conditioning). The paper bridges these directions by making architectural choices explicit predictors of performance under inference constraints. Nearby branches like Model Architecture Design for Efficiency explore structural innovations (MoE, hybrid models, attention mechanisms) but typically lack unified predictive frameworks. The scope_note for Architecture-Conditional Scaling Laws explicitly excludes architecture-agnostic approaches, positioning this work as addressing a gap between theoretical scaling and practical architectural diversity.

Among 20 candidates examined across three contributions, the conditional scaling law itself shows one refutable candidate among 10 examined, indicating some prior work on architecture-aware prediction exists within the limited search scope. The search framework for identifying inference-efficient architectures found no refutable candidates among 10 examined, suggesting this systematic optimization approach may be less explored. The characterization of architectural factors' impact was not examined against candidates. These statistics reflect a targeted literature search rather than exhaustive coverage, and the single refutable pair for the core contribution suggests the specific formulation may overlap with existing architecture-conditional frameworks.

Based on the limited search scope of 20 candidates, the work appears to occupy a moderately novel position by unifying architectural conditioning with inference-aware optimization in a single predictive framework. The sparse population of its taxonomy leaf and the absence of refutable candidates for the search framework component suggest potential novelty in the systematic approach, though the core scaling law formulation shows some overlap with prior architecture-aware prediction efforts. The analysis does not cover exhaustive comparison with all Chinchilla extensions or architecture search methods beyond the top-K semantic matches examined.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: architecture-aware scaling laws for inference-efficient language models. The field has evolved to address the dual challenge of predicting model performance while accounting for deployment costs and architectural choices. The taxonomy reveals eight major branches spanning the full lifecycle from design to deployment. Scaling Laws and Predictive Modeling focuses on mathematical frameworks that relate model size, data, and compute to performance, with recent extensions to architecture-conditional settings and inference budgets (Scaling Laws Architecture[0], Inference Scaling Laws[7]). Model Architecture Design for Efficiency explores structural innovations like mixture-of-experts and hybrid architectures (MoE Scaling Laws[2], Hybrid Architecture Analysis[39]), while Inference-Time Compute Scaling examines how test-time computation can be traded for accuracy (Test-Time Compute Scaling[1], Test-Time Reasoning Scaling[3]). Training-Time Scaling, Quantization, and System Infrastructure branches address optimization strategies, low-bit representations (Ternary Scaling Laws[49]), and practical deployment concerns (Cost Modeling LLMs[33]), with additional branches covering multimodal extensions and empirical benchmarking (LLM Efficiency Evaluation[11]). A central tension emerges between predictive accuracy and practical deployment constraints. Works like Inference-Efficient Models[8] and Beyond Chinchilla[13] challenge traditional compute-optimal training by incorporating inference costs into scaling decisions, while Inference Economics[31] explicitly models the economic trade-offs. Scaling Laws Architecture[0] sits within the Architecture-Conditional Scaling Laws cluster, emphasizing how different architectural families—dense transformers, MoE variants, or hybrid designs—exhibit distinct scaling behaviors that must be captured for accurate performance prediction under inference budgets. This contrasts with earlier work that treated architecture as fixed, and complements neighboring efforts like Inference-Efficient Models[8] which focus more on empirical comparisons across architectures. The original paper's emphasis on architecture-aware prediction bridges the gap between theoretical scaling frameworks and the practical reality that deployment efficiency depends critically on structural choices, a theme echoed across multiple branches but rarely integrated into unified predictive models.

Claimed Contributions

Conditional scaling law augmenting Chinchilla with architectural factors

The authors propose a conditional extension of the Chinchilla scaling laws that incorporates architectural parameters such as hidden size, mlp-to-attention ratio, and grouped-query attention. This framework enables predicting model performance while accounting for architectural design choices.

10 retrieved papers
Can Refute
Search framework for inference-efficient and accurate architectures

The authors develop a systematic framework (Algorithm 1) that uses the conditional scaling law to identify model architectures optimizing both inference efficiency and accuracy under fixed parameter and token budgets, including a local search procedure for grouped-query attention.

10 retrieved papers
Characterization of architectural factors' impact on inference and accuracy

The authors systematically study how hidden size, mlp-to-attention ratio, and GQA affect both inference throughput and model accuracy by training over 200 models ranging from 80M to 3B parameters, revealing U-shaped relationships between these factors and training loss.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Conditional scaling law augmenting Chinchilla with architectural factors

The authors propose a conditional extension of the Chinchilla scaling laws that incorporates architectural parameters such as hidden size, mlp-to-attention ratio, and grouped-query attention. This framework enables predicting model performance while accounting for architectural design choices.

Contribution

Search framework for inference-efficient and accurate architectures

The authors develop a systematic framework (Algorithm 1) that uses the conditional scaling law to identify model architectures optimizing both inference efficiency and accuracy under fixed parameter and token budgets, including a local search procedure for grouped-query attention.

Contribution

Characterization of architectural factors' impact on inference and accuracy

The authors systematically study how hidden size, mlp-to-attention ratio, and GQA affect both inference throughput and model accuracy by training over 200 models ranging from 80M to 3B parameters, revealing U-shaped relationships between these factors and training loss.