Harnessing Optimization Dynamics for Curvature-Informed Model Merging

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language Models (LLMs)Model MergingPruningPost-training
Abstract:

Model merging is an effective strategy for composing capabilities in large language models without the need for costly joint retraining. We study this process in the supervised fine-tuning (SFT) stage, consolidating multiple checkpoints specialized for distinct capabilities (e.g., math, coding, and precise instruction following) into a single model. First, we introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware method for mitigating task interference that uses optimizer second-moment statistics as a diagonal curvature proxy to first prune the task vector with our Fast Fisher Grafting (FFG) technique and then reweight the pruned vector. When merging diverse, capability-based checkpoints, OTA improves the merged model's performance over strong baseline methods, as evaluated on unseen capability-based benchmarks. Second, we conduct a comprehensive, theoretically-inspired empirical analysis to explain the effectiveness of OTA. Our analysis surprisingly reveals that FFG implicitly induces a layer- and role-wise aware pruning mechanism that is capable of maintaining fine-tuning performance at much more aggressive pruning ratios compared to magnitude pruning and that exhibits interpretable task localization properties. Third, an extensive comparison of our curvature proxy across capability checkpoints shows that experts converge to a basin with substantial curvature similarity, offering a novel lens on why simple linear merging can be effective in practice. This result further strengthens our ablation study, showing that FFG is critical for merging performance. Finally, we develop a memory-light variant of OTA that efficiently compresses the second moments, mitigating the additional storage requirements of our method and improving scalability. We make all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints accessible through an anonymized repository at \url{https://github.com/anon123ota-dotcom/ota-ffg}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Optimization Trajectory Aware (OTA) Merging, which uses optimizer second-moment statistics as a curvature proxy to prune and reweight task vectors when merging capability-specialized checkpoints. Within the taxonomy, it resides in the 'Optimization Trajectory-Aware Merging' leaf under 'Fisher Information-Based Merging'. This leaf contains only one sibling paper, indicating a relatively sparse research direction. The broader 'Fisher Information-Based Merging' branch includes three leaves (Fisher-Weighted Averaging, Alignment-Preserving, and Trajectory-Aware), suggesting moderate activity in curvature-aware merging but limited exploration of trajectory-based approaches specifically.

The paper's closest neighbors are Fisher-Weighted Averaging methods that use static Fisher matrices and Alignment-Preserving approaches that maintain safety constraints during merging. The taxonomy shows that trajectory-aware methods diverge from static Fisher approaches by incorporating optimization dynamics rather than post-hoc curvature estimates. The 'Geometry-Aware Regularization' branch addresses training-time constraints, while 'Geometry-Agnostic' methods avoid curvature modeling entirely. OTA bridges trajectory information and curvature approximation, positioning itself between static Fisher methods and pure momentum-based approaches, though the field remains relatively underpopulated in this specific intersection.

Among 23 candidates examined, the OTA framework with Fast Fisher Grafting shows one refutable candidate from three examined, suggesting some prior work overlap in the core methodology. The empirical analysis of FFG's layer-wise pruning mechanism examined ten candidates with zero refutations, indicating this contribution appears more novel within the limited search scope. The memory-efficient rank-one compression variant also examined ten candidates with one refutable match, suggesting moderate prior exploration of compression techniques. The statistics indicate that while the core framework has some precedent, the mechanistic analysis and specific pruning insights may represent less-explored territory among the papers reviewed.

Based on the top-23 semantic matches examined, the work appears to occupy a moderately novel position, particularly in its empirical analysis of pruning mechanisms. The limited sibling papers in the trajectory-aware leaf and the sparse refutation rate for the mechanistic contributions suggest the approach extends existing curvature-aware merging in relatively unexplored directions. However, the analysis does not cover the full breadth of model merging literature, and the presence of refutable candidates for the core framework indicates that related trajectory-based or second-moment methods exist in the broader field.

Taxonomy

Core-task Taxonomy Papers
10
3
Claimed Contributions
23
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Curvature-aware merging of fine-tuned language model checkpoints. The field addresses how to combine multiple fine-tuned models into a single checkpoint that preserves or enhances performance across tasks. The taxonomy reveals four main branches. Curvature-Aware Parameter Merging Methods leverage geometric information—such as Fisher information or optimization trajectories—to weight parameters during merging, exemplified by Fisher Weighted Averaging[3] and approaches that track momentum or curvature along training paths. Geometry-Aware Regularization and Stability focuses on maintaining desirable loss landscape properties during training or merging, often through manifold-based constraints as seen in Sequential Manifold Regularization[2]. Geometry-Agnostic Merging and Composition includes simpler averaging schemes like Geometric Median Merging[9] or latent-space methods such as Latent Merging[10], which do not explicitly model curvature. Specialized Merging Applications targets domain-specific scenarios, including agent-based systems like Agent Dice[5] or alignment-focused methods such as AlignMerge[6], where merging serves particular downstream goals beyond general multi-task performance. A particularly active line of work explores how to incorporate second-order geometry into merging strategies, balancing computational cost against the fidelity of the merged model. Fisher information-based approaches like Fisher Weighted Averaging[3] provide principled weighting but can be expensive to compute at scale, while trajectory-aware methods such as Momentum Aware Optimization[8] attempt to capture optimization dynamics more efficiently. Curvature Informed Merging[0] sits within this Fisher and trajectory-aware cluster, emphasizing how curvature along the optimization path can guide parameter interpolation. Compared to Fisher Weighted Averaging[3], which relies on static Fisher matrices, and Momentum Aware Optimization[8], which tracks first-order momentum, Curvature Informed Merging[0] aims to integrate richer geometric signals from the loss landscape. Open questions remain around scalability, the trade-off between geometric fidelity and computational overhead, and how these curvature-based techniques generalize across diverse fine-tuning regimes and model architectures.

Claimed Contributions

Optimization Trajectory Aware (OTA) Merging framework with Fast Fisher Grafting (FFG)

The authors propose a two-stage model merging framework that leverages Adam optimizer second-moment statistics as a curvature proxy. FFG first identifies and reverts noisy parameter updates using saliency-based pruning, then OTA aggregates the denoised experts via curvature-aware weighting to mitigate task interference when merging specialized checkpoints.

3 retrieved papers
Can Refute
Empirical analysis revealing FFG's implicit layer-wise and role-wise pruning mechanism

The authors conduct a comprehensive empirical study showing that FFG induces structured sparsity patterns with layer-depth and weight-type awareness. This mechanism aggressively prunes query and key layers while preserving value and output projections, maintaining performance at higher sparsity than magnitude pruning and revealing interpretable task localization.

10 retrieved papers
Memory-efficient variant using rank-one compression of second moments

The authors introduce an AdaFactor-inspired compression technique that stores only row-wise and column-wise sums of second-moment tensors, reconstructing a rank-one approximation at runtime. This reduces storage overhead from model-size scale to minimal requirements while maintaining merging performance.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Optimization Trajectory Aware (OTA) Merging framework with Fast Fisher Grafting (FFG)

The authors propose a two-stage model merging framework that leverages Adam optimizer second-moment statistics as a curvature proxy. FFG first identifies and reverts noisy parameter updates using saliency-based pruning, then OTA aggregates the denoised experts via curvature-aware weighting to mitigate task interference when merging specialized checkpoints.

Contribution

Empirical analysis revealing FFG's implicit layer-wise and role-wise pruning mechanism

The authors conduct a comprehensive empirical study showing that FFG induces structured sparsity patterns with layer-depth and weight-type awareness. This mechanism aggressively prunes query and key layers while preserving value and output projections, maintaining performance at higher sparsity than magnitude pruning and revealing interpretable task localization.

Contribution

Memory-efficient variant using rank-one compression of second moments

The authors introduce an AdaFactor-inspired compression technique that stores only row-wise and column-wise sums of second-moment tensors, reconstructing a rank-one approximation at runtime. This reduces storage overhead from model-size scale to minimal requirements while maintaining merging performance.