HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks

ICLR 2026 Conference SubmissionAnonymous Authors
Low-rank AdaptationMulti-head Self-attentionMixture of ExpertsHypernetworks
Abstract:

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique that adapts large pre-trained models by adding low-rank matrices to their weight updates. However, in the context of fine-tuning multi-head self-attention (MHA), LoRA has been employed to adapt each attention head separately, thereby overlooking potential synergies across different heads. To mitigate this issue, we propose a novel Hyper-shared Low-Rank Adaptation (HoRA) method, which utilizes joint hypernetworks to generate low-rank matrices across attention heads. By coupling their adaptation through a shared generator, HoRA encourages cross-head information sharing, and thus directly addresses the aforementioned limitation of LoRA. By comparing LoRA and HoRA through the lens of hierarchical mixture of experts, our theoretical findings reveal that the latter achieves superior sample efficiency to the former. Furthermore, through extensive experiments across diverse language and vision benchmarks, we demonstrate that HoRA outperforms LoRA and other PEFT methods while requiring only a marginal increase in the number of trainable parameters.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes HoRA, a cross-head low-rank adaptation method that uses joint hypernetworks to generate low-rank matrices across attention heads, enabling information sharing during fine-tuning. According to the taxonomy, this work occupies the 'Cross-Head Low-Rank Adaptation' leaf under 'Low-Rank Adaptation Methods', where it currently appears as the sole paper. This positioning suggests the paper addresses a relatively sparse research direction within the broader low-rank adaptation landscape, which contains multiple active leaves including standard per-head methods, orthogonality-constrained approaches, and multimodal adaptations.

The taxonomy reveals that HoRA sits adjacent to several related but distinct approaches. Its immediate neighbors include 'Standard Low-Rank Adaptation' methods that apply decomposition independently per head (e.g., Vision Transformer adaptations, Serial decomposition), 'Orthogonality-Constrained' methods that enforce structural properties, and 'Semantic-Guided' approaches that incorporate input semantics. The taxonomy's scope notes clarify that cross-head coupling distinguishes this work from standard per-head methods, while its hypernetwork-based sharing mechanism differentiates it from multi-task adapter routing strategies found in the 'Adapter-Based Methods' branch.

Among 16 candidates examined across three contributions, the analysis found limited prior work overlap. The core HoRA method (Contribution 1) examined 1 candidate with no refutations, suggesting novelty in the specific hypernetwork-based cross-head coupling mechanism. The theoretical connection to hierarchical mixture of experts (Contribution 2) examined 9 candidates with 1 refutable match, indicating some existing theoretical frameworks may overlap. The sample efficiency claim (Contribution 3) examined 6 candidates without refutations. These statistics reflect a focused semantic search scope rather than exhaustive coverage, and the sparse 'Cross-Head' leaf suggests this direction has received limited prior attention.

Given the limited search scope of 16 candidates and the paper's placement in a currently unpopulated taxonomy leaf, the work appears to explore a relatively underexplored mechanism for cross-head information sharing in low-rank adaptation. However, the analysis cannot rule out relevant prior work outside the examined candidate set, particularly in adjacent areas like multi-task adapter sharing or cross-layer smoothness exploitation that employ related coupling principles through different architectural choices.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
16
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: parameter-efficient fine-tuning of multi-head self-attention mechanisms. The field has organized itself around several complementary strategies for adapting large pre-trained models without full retraining. Low-Rank Adaptation Methods decompose weight updates into compact factorizations, enabling efficient parameter updates across attention layers; works like Orthogonal Fine-tuning[3] and Householder Transformation[5] explore structured low-rank constraints. Prompt-Based Tuning Methods inject learnable tokens or attention-level prompts (e.g., Attention Prompt Tuning[11]) to steer model behavior with minimal added parameters. Adapter-Based Methods insert small bottleneck modules between transformer blocks, as seen in AdaptFormer[9] and AdaViT[8], while Attention Mechanism Modification directly restructures attention computations—for instance, Alternating Attention[17] and Isomorphic Attention[20]. Dynamic and Adaptive Methods adjust tuning strategies on-the-fly (Dynamic Tuning[41], Adaptive Layer Selection[38]), Specialized Application Methods target domain-specific tasks like medical imaging (LiteMedSAM[31]) or action recognition (Action Recognition[28]), and Efficiency-Focused Optimization Methods prioritize inference speed and memory footprint (Economical Inference[4], Time-Memory Efficient[30]). A particularly active line of research centers on low-rank factorizations that exploit cross-head or cross-layer structure, balancing expressiveness with compactness. HoRA[0] exemplifies this direction by introducing cross-head low-rank adaptation, sharing decomposition factors across multiple attention heads to reduce redundancy. This approach contrasts with methods like Trainable Self-Attention[1], which modifies attention weights more directly, and PARA[2], which applies parameter-efficient updates in a different structural regime. Meanwhile, orthogonal constraints (Orthogonal Fine-tuning[3]) and transformation-based parameterizations (Householder Transformation[5]) offer alternative ways to maintain model stability and generalization during adaptation. The interplay between rank selection, head-wise sharing, and layer-specific tuning remains an open question, with HoRA[0] positioned among works that seek to exploit redundancy in multi-head architectures while preserving the expressive power needed for diverse downstream tasks.

Claimed Contributions

HoRA method with joint hypernetworks for cross-head information sharing

The authors introduce HoRA, a parameter-efficient fine-tuning technique that uses shared hypernetworks to generate low-rank adaptation matrices across multiple attention heads. This design encourages cross-head information sharing and addresses the limitation of LoRA, which adapts each attention head independently without coordination.

1 retrieved paper
Theoretical connection between multi-head LoRA and hierarchical mixture of experts

The authors formalize a theoretical relationship showing that applying LoRA to multi-head self-attention can be reinterpreted as a Hierarchical Mixture-of-Experts model. This perspective provides a principled foundation for understanding and improving parameter-efficient fine-tuning in multi-head attention.

9 retrieved papers
Can Refute
Sample efficiency improvement from exponential to polynomial rate

The authors prove that HoRA's shared structure across attention heads improves the sample complexity of estimating low-rank matrices from exponential order to polynomial order. This theoretical result demonstrates that parameter sharing yields superior generalization guarantees compared to independent adaptation.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HoRA method with joint hypernetworks for cross-head information sharing

The authors introduce HoRA, a parameter-efficient fine-tuning technique that uses shared hypernetworks to generate low-rank adaptation matrices across multiple attention heads. This design encourages cross-head information sharing and addresses the limitation of LoRA, which adapts each attention head independently without coordination.

Contribution

Theoretical connection between multi-head LoRA and hierarchical mixture of experts

The authors formalize a theoretical relationship showing that applying LoRA to multi-head self-attention can be reinterpreted as a Hierarchical Mixture-of-Experts model. This perspective provides a principled foundation for understanding and improving parameter-efficient fine-tuning in multi-head attention.

Contribution

Sample efficiency improvement from exponential to polynomial rate

The authors prove that HoRA's shared structure across attention heads improves the sample complexity of estimating low-rank matrices from exponential order to polynomial order. This theoretical result demonstrates that parameter sharing yields superior generalization guarantees compared to independent adaptation.