LinearSR: Unlocking Linear Attention for Stable and Efficient Image Super-Resolution

ICLR 2026 Conference SubmissionAnonymous Authors
Image Super-ResolutionLinear AttentionTraining Stability
Abstract:

Generative models for Image Super-Resolution (SR) are increasingly powerful, yet their reliance on self-attention's quadratic complexity (O(N2)O(N^2)) creates a major computational bottleneck. Linear Attention offers an O(N)O(N) solution, but its promise for photorealistic SR has remained largely untapped, historically hindered by a cascade of interrelated and previously unsolved challenges. This paper introduces LinearSR, a holistic framework that, for the first time, systematically overcomes these critical hurdles. Specifically, we resolve a fundamental, training instability that causes catastrophic model divergence using our novel ''knee point''-based Early-Stopping Guided Fine-tuning (ESGF) strategy. Furthermore, we mitigate the classic perception-distortion trade-off with a dedicated SNR-based Mixture of Experts (MoE) architecture. Finally, we establish an effective and lightweight guidance paradigm, TAG, derived from our ''precision-over-volume'' principle. Our resulting LinearSR model simultaneously delivers state-of-the-art perceptual quality with exceptional efficiency. Its core diffusion forward pass (1-NFE) achieves SOTA-level speed, while its overall multi-step inference time remains highly competitive. This work provides the first robust methodology for applying Linear Attention in the photorealistic SR domain, establishing a foundational paradigm for future research in efficient generative super-resolution.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

LinearSR introduces a holistic framework combining three components: Early-Stopping Guided Fine-tuning (ESGF) to stabilize training, an SNR-based Mixture of Experts architecture to balance perception and distortion, and a TAG guidance paradigm for efficient inference. The paper resides in the Pure Linear Attention Transformers leaf, which contains four papers total including LinearSR itself. This represents a relatively sparse research direction within the broader taxonomy of thirty papers, suggesting that pure linear attention approaches for super-resolution remain an emerging area compared to hybrid or alternative efficient attention mechanisms.

The taxonomy reveals that LinearSR's leaf sits within the larger Linear Attention Architectures for Super-Resolution branch, which also includes Hybrid Linear Attention and CNN Architectures (three papers) and Multi-Scale Linear Attention Networks (two papers). Neighboring branches explore fundamentally different efficiency strategies: Alternative Efficient Attention Mechanisms uses window-based or approximation methods (nine papers total), while State Space Models and Alternative Architectures employs Mamba-based or recurrent designs (four papers). The scope note for LinearSR's leaf explicitly excludes hybrid CNN-Transformer models and softmax attention variants, positioning this work as committed to pure linear attention throughout the architecture.

Among nineteen candidates examined across three contributions, no refutable prior work was identified. The ESGF strategy examined ten candidates with zero refutations, the SNR-based MoE examined seven candidates with zero refutations, and TAG examined two candidates with zero refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of training stabilization, expert routing, and guidance mechanisms appears distinct from examined prior work. However, the relatively small candidate pools per contribution (two to ten papers) mean the analysis covers a focused subset of potentially relevant literature rather than an exhaustive survey.

Based on the limited search scope of nineteen candidates, LinearSR's contributions appear to occupy a relatively unexplored intersection of training stability, perception-distortion balancing, and guidance design within pure linear attention super-resolution. The sparse population of its taxonomy leaf and absence of refutable overlaps among examined candidates suggest novelty, though the analysis does not cover the full landscape of diffusion-based super-resolution or broader generative modeling literature that might contain related techniques.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
19
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Efficient image super-resolution using linear attention mechanisms. The field of efficient super-resolution has evolved into several distinct branches that balance reconstruction quality with computational cost. Linear Attention Architectures for Super-Resolution represents the primary thrust, exploring pure linear attention transformers and hybrid designs that replace quadratic self-attention with linear-complexity variants to enable processing of high-resolution images. Alternative Efficient Attention Mechanisms encompasses works like Directional Variance Attention[1] and Dual Compression Transformer[2] that reduce attention overhead through spatial compression or selective computation rather than full linearization. State Space Models and Alternative Architectures introduce fundamentally different sequence modeling paradigms, while Domain-Specific Efficient Super-Resolution targets specialized applications such as remote sensing or medical imaging. Continuous and Implicit Representation Methods, exemplified by HIIF[6] and Super-Resolution Neural Operator[4], shift toward coordinate-based neural fields, and Classical Linear Mapping Approaches like Multiple Linear Mappings[21] provide foundational regression-based baselines. Within the linear attention landscape, a central tension exists between pure linear formulations that achieve strict linear complexity and hybrid approaches that selectively retain some quadratic components for critical features. LinearSR[0] sits squarely in the Pure Linear Attention Transformers cluster, emphasizing complete replacement of softmax attention with linear kernels to maintain efficiency across all layers. This contrasts with nearby works such as LCFormer[13] and Linear Adaptive Dimensions[15], which explore adaptive mechanisms that modulate linear attention based on local image statistics or learnable dimension adjustments. Meanwhile, Rank Enhanced Attention[12] investigates low-rank decompositions to further compress attention matrices. The key open question across these branches remains whether pure linear attention can match the representational power of quadratic attention for fine texture recovery, or whether carefully designed hybrid or adaptive schemes offer a better efficiency-quality trade-off for practical super-resolution deployment.

Claimed Contributions

Early-Stopping Guided Fine-tuning (ESGF) strategy

A training methodology that identifies a critical knee-point checkpoint in the loss landscape where the model achieves optimal generalization before performance degrades. Fine-tuning is initialized from this stable checkpoint to prevent catastrophic training collapse when applying linear attention to super-resolution.

10 retrieved papers
SNR-based Mixture of Experts (MoE) architecture

A specialized expert architecture that partitions the generative trajectory using hierarchical log-SNR bisection, assigning different experts to handle structure generation at high-noise stages and detail refinement at low-noise stages, thereby addressing the perception-distortion trade-off in super-resolution.

7 retrieved papers
TAG guidance paradigm based on precision-over-volume principle

A guidance approach using concise object-level tags rather than verbose text descriptions or raw visual features. This principle demonstrates that a smaller, targeted guidance signal extracted from the low-resolution image itself is more effective for super-resolution than external semantic information.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Early-Stopping Guided Fine-tuning (ESGF) strategy

A training methodology that identifies a critical knee-point checkpoint in the loss landscape where the model achieves optimal generalization before performance degrades. Fine-tuning is initialized from this stable checkpoint to prevent catastrophic training collapse when applying linear attention to super-resolution.

Contribution

SNR-based Mixture of Experts (MoE) architecture

A specialized expert architecture that partitions the generative trajectory using hierarchical log-SNR bisection, assigning different experts to handle structure generation at high-noise stages and detail refinement at low-noise stages, thereby addressing the perception-distortion trade-off in super-resolution.

Contribution

TAG guidance paradigm based on precision-over-volume principle

A guidance approach using concise object-level tags rather than verbose text descriptions or raw visual features. This principle demonstrates that a smaller, targeted guidance signal extracted from the low-resolution image itself is more effective for super-resolution than external semantic information.