AnyUp: Universal Feature Upsampling

ICLR 2026 Conference SubmissionAnonymous Authors
feature upsamplingrepresentation learning
Abstract:

We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes AnyUp, an inference-time feature upsampling method designed to work with any vision encoder at any resolution without encoder-specific training. Within the taxonomy, it resides in the 'Inference-Time Universal Upsamplers' leaf, which contains only two papers including this one. This represents a sparse research direction within the broader 'Universal Feature Upsampling Methods' branch, suggesting the work addresses a relatively underexplored problem space compared to encoder-specific or task-specific upsampling approaches that dominate other branches.

The taxonomy reveals that most upsampling research concentrates on encoder-specific methods (six papers across vision foundation models and transformer architectures) or task-specific approaches (fifteen papers spanning super-resolution, dense prediction, and domain-specific applications). The 'Trainable Universal Upsamplers' sibling branch contains three papers that require training on diverse features, whereas AnyUp's inference-time approach diverges by eliminating training requirements entirely. This positioning suggests the work bridges a gap between the flexibility of universal methods and the practicality of zero-shot deployment.

Among twenty candidates examined across three contributions, none were identified as clearly refuting the core claims. The main contribution 'AnyUp: feature-agnostic upsampling model' examined ten candidates with zero refutable matches, as did the 'Feature-agnostic layer' contribution. The 'Window attention architecture with crop-based training' was not evaluated against any candidates. Given this limited search scope of twenty papers from semantic search and citation expansion, the analysis suggests no immediate prior work overlap within the examined set, though the small candidate pool means substantial related work may exist beyond this sample.

Based on the limited literature search covering twenty candidates, the work appears to occupy a novel position within a sparse research direction. The taxonomy structure indicates that while universal upsampling is an established goal, inference-time approaches without training remain rare. However, the small search scope and the presence of only one sibling paper limit confidence in assessing broader field coverage or potential overlaps with work outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers
24
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: universal feature upsampling across vision encoders and resolutions. The field addresses the challenge of recovering high-resolution spatial detail from coarse feature maps produced by diverse vision encoders, which is essential for dense prediction tasks such as segmentation and depth estimation. The taxonomy organizes approaches into three main branches. Universal Feature Upsampling Methods aim to develop encoder-agnostic techniques that generalize across different backbone architectures and input resolutions, often leveraging learned upsampling modules or implicit representations. Encoder-Specific Feature Upsampling tailors solutions to particular network families, exploiting architectural priors or training-time adaptations to achieve tighter integration with specific encoders. Task-Specific Resolution Enhancement focuses on domain-driven strategies, where upsampling is optimized for particular applications such as medical imaging, remote sensing, or video analysis, often incorporating task-relevant inductive biases. Recent work has explored trade-offs between generality and performance. Universal methods like FeatUp[18] and Upsample Anything[20] pursue broad applicability by training upsampling networks that can handle features from multiple encoders without retraining, while AnyUp[0] extends this paradigm by proposing an inference-time universal upsampler that adapts on-the-fly to unseen encoders and resolutions. This contrasts with encoder-specific approaches such as Cross Resolution Attention[1] or task-driven techniques like MGD-SAM2[5], which sacrifice some generality for tighter coupling to particular architectures or domains. A key open question is whether universal upsamplers can match the fidelity of specialized methods while maintaining their flexibility. AnyUp[0] sits within the inference-time universal branch alongside Upsample Anything[20], emphasizing zero-shot adaptability, whereas FeatUp[18] represents an earlier training-based universal approach that requires pre-training on a fixed set of encoders.

Claimed Contributions

AnyUp: feature-agnostic upsampling model

AnyUp is a universal feature upsampling method that can be trained once and then applied to features from any vision encoder at any resolution without requiring encoder-specific retraining, unlike existing methods that must be retrained for each feature extractor.

10 retrieved papers
Feature-agnostic layer

A convolutional layer design that processes input channels independently using a learned kernel basis and aggregates contributions across channels, enabling the model to handle features of arbitrary dimensionality while capturing structural information.

10 retrieved papers
Window attention architecture with crop-based training

An upsampling architecture that restricts attention computation to local windows and employs a training strategy using randomly sampled image crops as supervision, combined with consistency regularization to preserve the original feature space.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AnyUp: feature-agnostic upsampling model

AnyUp is a universal feature upsampling method that can be trained once and then applied to features from any vision encoder at any resolution without requiring encoder-specific retraining, unlike existing methods that must be retrained for each feature extractor.

Contribution

Feature-agnostic layer

A convolutional layer design that processes input channels independently using a learned kernel basis and aggregates contributions across channels, enabling the model to handle features of arbitrary dimensionality while capturing structural information.

Contribution

Window attention architecture with crop-based training

An upsampling architecture that restricts attention computation to local windows and employs a training strategy using randomly sampled image crops as supervision, combined with consistency regularization to preserve the original feature space.