Harnessing Optimization Dynamics for Curvature-Informed Model Merging

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language Models (LLMs)Model MergingPruningPost-training

Model merging is an effective strategy for composing capabilities in large language models without the need for costly joint retraining. We study this process in the supervised fine-tuning (SFT) stage, consolidating multiple checkpoints specialized for distinct capabilities (e.g., math, coding, and precise instruction following) into a single model. First, we introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware method for mitigating task interference that uses optimizer second-moment statistics as a diagonal curvature proxy to first prune the task vector with our Fast Fisher Grafting (FFG) technique and then reweight the pruned vector. When merging diverse, capability-based checkpoints, OTA improves the merged model's performance over strong baseline methods, as evaluated on unseen capability-based benchmarks. Second, we conduct a comprehensive, theoretically-inspired empirical analysis to explain the effectiveness of OTA. Our analysis surprisingly reveals that FFG implicitly induces a layer- and role-wise aware pruning mechanism that is capable of maintaining fine-tuning performance at much more aggressive pruning ratios compared to magnitude pruning and that exhibits interpretable task localization properties. Third, an extensive comparison of our curvature proxy across capability checkpoints shows that experts converge to a basin with substantial curvature similarity, offering a novel lens on why simple linear merging can be effective in practice. This result further strengthens our ablation study, showing that FFG is critical for merging performance. Finally, we develop a memory-light variant of OTA that efficiently compresses the second moments, mitigating the additional storage requirements of our method and improving scalability. We make all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints accessible through an anonymized repository at \url{https://github.com/anon123ota-dotcom/ota-ffg}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Optimization Trajectory Aware (OTA) Merging, which uses optimizer second-moment statistics as a curvature proxy to prune and reweight task vectors when merging capability-specialized checkpoints. Within the taxonomy, it resides in the 'Optimization Trajectory-Aware Merging' leaf under 'Fisher Information-Based Merging'. This leaf contains only one sibling paper, indicating a relatively sparse research direction. The broader 'Fisher Information-Based Merging' branch includes three leaves (Fisher-Weighted Averaging, Alignment-Preserving, and Trajectory-Aware), suggesting moderate activity in curvature-aware merging but limited exploration of trajectory-based approaches specifically.

The paper's closest neighbors are Fisher-Weighted Averaging methods that use static Fisher matrices and Alignment-Preserving approaches that maintain safety constraints during merging. The taxonomy shows that trajectory-aware methods diverge from static Fisher approaches by incorporating optimization dynamics rather than post-hoc curvature estimates. The 'Geometry-Aware Regularization' branch addresses training-time constraints, while 'Geometry-Agnostic' methods avoid curvature modeling entirely. OTA bridges trajectory information and curvature approximation, positioning itself between static Fisher methods and pure momentum-based approaches, though the field remains relatively underpopulated in this specific intersection.

Among 23 candidates examined, the OTA framework with Fast Fisher Grafting shows one refutable candidate from three examined, suggesting some prior work overlap in the core methodology. The empirical analysis of FFG's layer-wise pruning mechanism examined ten candidates with zero refutations, indicating this contribution appears more novel within the limited search scope. The memory-efficient rank-one compression variant also examined ten candidates with one refutable match, suggesting moderate prior exploration of compression techniques. The statistics indicate that while the core framework has some precedent, the mechanistic analysis and specific pruning insights may represent less-explored territory among the papers reviewed.

Based on the top-23 semantic matches examined, the work appears to occupy a moderately novel position, particularly in its empirical analysis of pruning mechanisms. The limited sibling papers in the trajectory-aware leaf and the sparse refutation rate for the mechanistic contributions suggest the approach extends existing curvature-aware merging in relatively unexplored directions. However, the analysis does not cover the full breadth of model merging literature, and the presence of refutable candidates for the core framework indicates that related trajectory-based or second-moment methods exist in the broader field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Curvature-aware merging of fine-tuned language model checkpoints. The field addresses how to combine multiple fine-tuned models into a single checkpoint that preserves or enhances performance across tasks. The taxonomy reveals four main branches. Curvature-Aware Parameter Merging Methods leverage geometric information—such as Fisher information or optimization trajectories—to weight parameters during merging, exemplified by Fisher Weighted Averaging[3] and approaches that track momentum or curvature along training paths. Geometry-Aware Regularization and Stability focuses on maintaining desirable loss landscape properties during training or merging, often through manifold-based constraints as seen in Sequential Manifold Regularization[2]. Geometry-Agnostic Merging and Composition includes simpler averaging schemes like Geometric Median Merging[9] or latent-space methods such as Latent Merging[10], which do not explicitly model curvature. Specialized Merging Applications targets domain-specific scenarios, including agent-based systems like Agent Dice[5] or alignment-focused methods such as AlignMerge[6], where merging serves particular downstream goals beyond general multi-task performance. A particularly active line of work explores how to incorporate second-order geometry into merging strategies, balancing computational cost against the fidelity of the merged model. Fisher information-based approaches like Fisher Weighted Averaging[3] provide principled weighting but can be expensive to compute at scale, while trajectory-aware methods such as Momentum Aware Optimization[8] attempt to capture optimization dynamics more efficiently. Curvature Informed Merging[0] sits within this Fisher and trajectory-aware cluster, emphasizing how curvature along the optimization path can guide parameter interpolation. Compared to Fisher Weighted Averaging[3], which relies on static Fisher matrices, and Momentum Aware Optimization[8], which tracks first-order momentum, Curvature Informed Merging[0] aims to integrate richer geometric signals from the loss landscape. Open questions remain around scalability, the trade-off between geometric fidelity and computational overhead, and how these curvature-based techniques generalize across diverse fine-tuning regimes and model architectures.

Claimed Contributions

Optimization Trajectory Aware (OTA) Merging framework with Fast Fisher Grafting (FFG)

Can Refute

3 retrieved papers

The authors propose a two-stage model merging framework that leverages Adam optimizer second-moment statistics as a curvature proxy. FFG first identifies and reverts noisy parameter updates using saliency-based pruning, then OTA aggregates the denoised experts via curvature-aware weighting to mitigate task interference when merging specialized checkpoints.

3 retrieved papers

Can Refute

Empirical analysis revealing FFG's implicit layer-wise and role-wise pruning mechanism

10 retrieved papers

The authors conduct a comprehensive empirical study showing that FFG induces structured sparsity patterns with layer-depth and weight-type awareness. This mechanism aggressively prunes query and key layers while preserving value and output projections, maintaining performance at higher sparsity than magnitude pruning and revealing interpretable task localization.

10 retrieved papers

Memory-efficient variant using rank-one compression of second moments

Can Refute

10 retrieved papers

The authors introduce an AdaFactor-inspired compression technique that stores only row-wise and column-wise sums of second-moment tensors, reconstructing a rank-one approximation at runtime. This reduces storage overhead from model-size scale to minimal requirements while maintaining merging performance.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Bridging Training and Merging Through Momentum-Aware Optimization PDF

Alireza Moayedikia, Alicia Troncoso (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Optimization Trajectory Aware (OTA) Merging framework with Fast Fisher Grafting (FFG)

[8] Bridging Training and Merging Through Momentum-Aware Optimization PDF

Can Refute

[1] Camex: Curvature-aware merging of experts PDF

Cannot Refute

[11] Defending Unauthorized Model Merging via Dual-Stage Weight Protection PDF

Cannot Refute

Contribution

Empirical analysis revealing FFG's implicit layer-wise and role-wise pruning mechanism

[12] A Study of Structured Pruning for Hybrid Neural Networks PDF

Cannot Refute

[13] Structured Pruning for Multi-Task Deep Neural Networks PDF

Cannot Refute

[14] Torque based Structured Pruning for Deep Neural Network PDF

Cannot Refute

[15] Structured pruning adapters PDF

Cannot Refute

[16] Efficient and dynamic layer-wise structured N:M pruning of deep neural networks PDF

Cannot Refute

[17] A Generic Layer Pruning Method for Signal Modulation Recognition Deep Learning Models PDF

Cannot Refute

[18] Stage-Wise Magnitude-Based Pruning for Recurrent Neural Networks PDF

Cannot Refute

[19] Importance estimation for neural network pruning PDF

Cannot Refute

[20] Adaptive Sparse Structure Development with Pruning and Regeneration for Spiking Neural Networks PDF

Cannot Refute

[21] Structured Pruning of Neural Networks for Constraints Learning PDF

Cannot Refute

Contribution

Memory-efficient variant using rank-one compression of second moments

[30] Sketchy: Memory-efficient adaptive regularization with frequent directions PDF

Can Refute

[22] Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models PDF

Cannot Refute

[23] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

Cannot Refute

[24] Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis PDF

Cannot Refute

[25] Block low-rank preconditioner with shared basis for stochastic optimization PDF

Cannot Refute

[26] Memory-efficient llm training with online subspace descent PDF

Cannot Refute

[27] AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning PDF

Cannot Refute

[28] Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays PDF

Cannot Refute

[29] Exact and stable covariance estimation from quadratic sampling via convex programming PDF

Cannot Refute

[31] KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products PDF

Cannot Refute

Harnessing Optimization Dynamics for Curvature-Informed Model Merging

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Bridging Training and Merging Through Momentum-Aware Optimization PDF

Contribution Analysis

Optimization Trajectory Aware (OTA) Merging framework with Fast Fisher Grafting (FFG)

[8] Bridging Training and Merging Through Momentum-Aware Optimization PDF

[1] Camex: Curvature-aware merging of experts PDF

[11] Defending Unauthorized Model Merging via Dual-Stage Weight Protection PDF

Empirical analysis revealing FFG's implicit layer-wise and role-wise pruning mechanism

[12] A Study of Structured Pruning for Hybrid Neural Networks PDF

[13] Structured Pruning for Multi-Task Deep Neural Networks PDF

[14] Torque based Structured Pruning for Deep Neural Network PDF

[15] Structured pruning adapters PDF

[16] Efficient and dynamic layer-wise structured N:M pruning of deep neural networks PDF

[17] A Generic Layer Pruning Method for Signal Modulation Recognition Deep Learning Models PDF

[18] Stage-Wise Magnitude-Based Pruning for Recurrent Neural Networks PDF

[19] Importance estimation for neural network pruning PDF

[20] Adaptive Sparse Structure Development with Pruning and Regeneration for Spiking Neural Networks PDF

[21] Structured Pruning of Neural Networks for Constraints Learning PDF

Memory-efficient variant using rank-one compression of second moments

[30] Sketchy: Memory-efficient adaptive regularization with frequent directions PDF

[22] Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models PDF

[23] SVD-Free Low-Rank Adaptive Gradient Optimization for Large Language Models PDF

[24] Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis PDF

[25] Block low-rank preconditioner with shared basis for stochastic optimization PDF

[26] Memory-efficient llm training with online subspace descent PDF

[27] AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning PDF

[28] Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays PDF

[29] Exact and stable covariance estimation from quadratic sampling via convex programming PDF

[31] KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products PDF

Table of Contents