Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

ICLR 2026 Conference SubmissionAnonymous Authors
Representation LearningFew-Shot Anomaly DetectionApplications of Foundation Models
Abstract:

Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance compared to other approaches, while surpassing them in model size and inference efficiency. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection. Our code will be made public.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FoundAD, a few-shot anomaly detector that learns a nonlinear projection operator onto the natural image manifold using foundation visual encoders. It sits within the Self-Supervised Vision Encoders leaf of the taxonomy, alongside two sibling papers: Anomalydino and MAEDAY. This leaf is part of the Vision-Only Foundation Models branch, which contains three leaves and represents a moderately populated research direction. The taxonomy shows that self-supervised vision encoders form a distinct approach compared to supervised pre-trained models or multi-encoder fusion strategies, indicating a focused but not overcrowded niche.

The taxonomy reveals that Vision-Only Foundation Models is one of several major branches, with neighboring directions including Vision-Language Model Adaptation (prompt-based learning, feature alignment) and Multimodal Large Language Models (VQA, reasoning). The scope note for Self-Supervised Vision Encoders explicitly excludes supervised ImageNet models and multi-encoder fusion, clarifying that FoundAD's reliance on self-supervised features (including DINOv2) distinguishes it from methods combining multiple encoders or leveraging language modality. The broader Vision-Only branch emphasizes feature extraction without textual supervision, contrasting with the prompt engineering and cross-modal alignment strategies prevalent in adjacent branches.

Among the three contributions analyzed, none were clearly refuted by the 30 candidates examined. Contribution A (correlation between anomaly amount and embedding distance) examined 10 candidates with zero refutable matches. Contribution B (FoundAD manifold projection) and Contribution C (text-free multi-class framework) each examined 10 candidates, also with zero refutable matches. This suggests that within the limited search scope, the specific combination of manifold projection and multi-class detection using foundation encoders appears relatively underexplored. However, the sibling papers Anomalydino and MAEDAY likely share overlapping feature extraction strategies, indicating that the core novelty may reside in the projection operator design rather than the encoder choice itself.

Based on the limited literature search of 30 candidates, the work appears to occupy a moderately novel position within self-supervised vision-only anomaly detection. The absence of refutable prior work across all contributions suggests that the specific manifold projection approach and multi-class framework may not have direct precedents in the examined set. However, the analysis does not cover the full breadth of vision-language or generative methods, and the sibling papers indicate that foundation encoder usage for anomaly detection is an active area. The novelty likely hinges on the projection operator's design and efficiency claims rather than the foundational encoder concept.

Taxonomy

Core-task Taxonomy Papers
46
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: few-shot anomaly detection using foundation visual encoders. The field has evolved around leveraging pretrained visual representations to identify defects or outliers with minimal labeled examples. The taxonomy reveals several major branches: Vision-Language Model Adaptation methods harness CLIP-like architectures to align textual and visual cues for anomaly scoring, often through prompt engineering or fine-tuning strategies (e.g., Winclip[2], Promptad[6]). Vision-Only Foundation Models rely on self-supervised encoders such as DINO or masked autoencoders to extract rich features without language supervision (e.g., Anomalydino[11], MAEDAY[20]). Multimodal Large Language Models integrate reasoning capabilities to interpret visual anomalies in context (e.g., LLMs Visual Anomalies[23], Light MLLMAD[40]). Generative Model-Based Approaches synthesize or reconstruct normal patterns to highlight deviations, while Unified and Generalist Frameworks aim for cross-domain applicability (e.g., UniVAD[14], Generalist InContext Residual[7]). Domain-Specific Applications target sectors like medical imaging, fabric inspection, or industrial quality control, and Auxiliary Techniques provide methodological foundations such as outlier synthesis or curriculum learning. Recent work explores trade-offs between adaptation complexity and generalization: vision-language methods offer semantic interpretability but may require careful prompt design, whereas vision-only encoders like those in Anomalydino[11] or MAEDAY[20] emphasize robustness through self-supervised pretraining. Foundation Visual Encoders[0] sits within the Vision-Only Foundation Models branch, specifically among Self-Supervised Vision Encoders, sharing conceptual ground with Anomalydino[11] and MAEDAY[20]. While these neighbors leverage DINO-based or masked autoencoder features, Foundation Visual Encoders[0] likely emphasizes a distinct encoder architecture or training regime to capture anomaly-relevant patterns in few-shot settings. Open questions persist around optimal feature extraction strategies, the role of domain-specific fine-tuning versus zero-shot transfer, and how to balance computational efficiency with detection accuracy across diverse anomaly types.

Claimed Contributions

Correlation between anomaly amount and embedding distance in foundation encoders

The authors reveal that foundation visual encoders exhibit a direct correlation between the pixel amount of anomalies in an image and the distance in their learned embedding space. This observation forms the basis for their anomaly detection approach.

10 retrieved papers
FoundAD: Few-shot anomaly detector using manifold projection

The authors introduce FoundAD, a few-shot anomaly detection method that learns a lightweight nonlinear projection operator to map feature embeddings onto the natural image manifold learned by foundation models. The projector enables effective anomaly detection with minimal training samples.

10 retrieved papers
Text-free multi-class anomaly detection framework

The authors demonstrate that foundation visual features alone, without textual assistance or prompts, are sufficient for effective few-shot anomaly detection. Their approach supports multi-class detection using substantially fewer parameters than prior methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Correlation between anomaly amount and embedding distance in foundation encoders

The authors reveal that foundation visual encoders exhibit a direct correlation between the pixel amount of anomalies in an image and the distance in their learned embedding space. This observation forms the basis for their anomaly detection approach.

Contribution

FoundAD: Few-shot anomaly detector using manifold projection

The authors introduce FoundAD, a few-shot anomaly detection method that learns a lightweight nonlinear projection operator to map feature embeddings onto the natural image manifold learned by foundation models. The projector enables effective anomaly detection with minimal training samples.

Contribution

Text-free multi-class anomaly detection framework

The authors demonstrate that foundation visual features alone, without textual assistance or prompts, are sufficient for effective few-shot anomaly detection. Their approach supports multi-class detection using substantially fewer parameters than prior methods.

Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors | Novelty Validation