Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
Overview
Overall Novelty Assessment
The paper proposes a conditional scaling law that extends the Chinchilla framework by incorporating architectural parameters—hidden size, MLP-to-attention ratio, and grouped-query attention—to predict both accuracy and inference cost. It resides in the Architecture-Conditional Scaling Laws leaf, which contains only two papers including this one. This represents a relatively sparse research direction within the broader Scaling Laws and Predictive Modeling branch, suggesting the integration of architectural factors into scaling law frameworks remains an emerging area despite the maturity of compute-optimal scaling research.
The taxonomy reveals neighboring leaves focused on Training Compute-Optimal Scaling Laws (architecture-agnostic Chinchilla-style analyses) and Inference-Aware Scaling Laws (optimizing for deployment costs without explicit architectural conditioning). The paper bridges these directions by making architectural choices explicit predictors of performance under inference constraints. Nearby branches like Model Architecture Design for Efficiency explore structural innovations (MoE, hybrid models, attention mechanisms) but typically lack unified predictive frameworks. The scope_note for Architecture-Conditional Scaling Laws explicitly excludes architecture-agnostic approaches, positioning this work as addressing a gap between theoretical scaling and practical architectural diversity.
Among 20 candidates examined across three contributions, the conditional scaling law itself shows one refutable candidate among 10 examined, indicating some prior work on architecture-aware prediction exists within the limited search scope. The search framework for identifying inference-efficient architectures found no refutable candidates among 10 examined, suggesting this systematic optimization approach may be less explored. The characterization of architectural factors' impact was not examined against candidates. These statistics reflect a targeted literature search rather than exhaustive coverage, and the single refutable pair for the core contribution suggests the specific formulation may overlap with existing architecture-conditional frameworks.
Based on the limited search scope of 20 candidates, the work appears to occupy a moderately novel position by unifying architectural conditioning with inference-aware optimization in a single predictive framework. The sparse population of its taxonomy leaf and the absence of refutable candidates for the search framework component suggest potential novelty in the systematic approach, though the core scaling law formulation shows some overlap with prior architecture-aware prediction efforts. The analysis does not cover exhaustive comparison with all Chinchilla extensions or architecture search methods beyond the top-K semantic matches examined.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a conditional extension of the Chinchilla scaling laws that incorporates architectural parameters such as hidden size, mlp-to-attention ratio, and grouped-query attention. This framework enables predicting model performance while accounting for architectural design choices.
The authors develop a systematic framework (Algorithm 1) that uses the conditional scaling law to identify model architectures optimizing both inference efficiency and accuracy under fixed parameter and token budgets, including a local search procedure for grouped-query attention.
The authors systematically study how hidden size, mlp-to-attention ratio, and GQA affect both inference throughput and model accuracy by training over 200 models ranging from 80M to 3B parameters, revealing U-shaped relationships between these factors and training loss.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[8] Scaling Inference-Efficient Language Models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Conditional scaling law augmenting Chinchilla with architectural factors
The authors propose a conditional extension of the Chinchilla scaling laws that incorporates architectural parameters such as hidden size, mlp-to-attention ratio, and grouped-query attention. This framework enables predicting model performance while accounting for architectural design choices.
[8] Scaling Inference-Efficient Language Models PDF
[51] Observational scaling laws and the predictability of langauge model performance PDF
[52] Physics of language models: Part 3.3, knowledge capacity scaling laws PDF
[53] CodeGen2: Lessons for Training LLMs on Programming and Natural Languages PDF
[54] Slamming: Training a Speech Language Model on One GPU in a Day PDF
[55] AI and Memory Wall PDF
[56] Scaling laws with vocabulary: Larger models deserve larger vocabularies PDF
[57] Farseer: A Refined Scaling Law in Large Language Models PDF
[58] Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? PDF
[59] Collaborative performance prediction for large language models PDF
Search framework for inference-efficient and accurate architectures
The authors develop a systematic framework (Algorithm 1) that uses the conditional scaling law to identify model architectures optimizing both inference efficiency and accuracy under fixed parameter and token budgets, including a local search procedure for grouped-query attention.
[60] Efficient neural architecture search via parameters sharing PDF
[61] MARCO: Hardware-Aware Neural Architecture Search for Edge Devices with Multi-Agent Reinforcement Learning and Conformal Prediction Filtering PDF
[62] Neural architecture search for resource constrained hardware devices: A survey PDF
[63] Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers PDF
[64] Mixed precision neural architecture search for energy efficient deep learning PDF
[65] Semi-supervised neural architecture search PDF
[66] Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search PDF
[67] Neural architecture search as multiobjective optimization benchmarks: Problem formulation and performance assessment PDF
[68] Hao: Hardware-aware neural architecture optimization for efficient inference PDF
[69] Faststereonet: A fast neural architecture search for improving the inference of disparity estimation on resource-limited platforms PDF
Characterization of architectural factors' impact on inference and accuracy
The authors systematically study how hidden size, mlp-to-attention ratio, and GQA affect both inference throughput and model accuracy by training over 200 models ranging from 80M to 3B parameters, revealing U-shaped relationships between these factors and training loss.