Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Representation LearningFew-Shot Anomaly DetectionApplications of Foundation Models

Few-shot anomaly detection streamlines and simplifies industrial safety inspection. However, limited samples make accurate differentiation between normal and abnormal features challenging, and even more so under category-agnostic conditions. Large-scale pre-training of foundation visual encoders has advanced many fields, as the enormous quantity of data helps to learn the general distribution of normal images. We observe that the anomaly amount in an image directly correlates with the difference in the learnt embeddings and utilize this to design a few-shot anomaly detector termed FoundAD. This is done by learning a nonlinear projection operator onto the natural image manifold. The simple operator acts as an effective tool for anomaly detection to characterize and identify out-of-distribution regions in an image. Extensive experiments show that our approach supports multi-class detection and achieves competitive performance compared to other approaches, while surpassing them in model size and inference efficiency. Backed up by evaluations with multiple foundation encoders, including fresh DINOv3, we believe this idea broadens the perspective on foundation features and advances the field of few-shot anomaly detection. Our code will be made public.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FoundAD, a few-shot anomaly detector that learns a nonlinear projection operator onto the natural image manifold using foundation visual encoders. It sits within the Self-Supervised Vision Encoders leaf of the taxonomy, alongside two sibling papers: Anomalydino and MAEDAY. This leaf is part of the Vision-Only Foundation Models branch, which contains three leaves and represents a moderately populated research direction. The taxonomy shows that self-supervised vision encoders form a distinct approach compared to supervised pre-trained models or multi-encoder fusion strategies, indicating a focused but not overcrowded niche.

The taxonomy reveals that Vision-Only Foundation Models is one of several major branches, with neighboring directions including Vision-Language Model Adaptation (prompt-based learning, feature alignment) and Multimodal Large Language Models (VQA, reasoning). The scope note for Self-Supervised Vision Encoders explicitly excludes supervised ImageNet models and multi-encoder fusion, clarifying that FoundAD's reliance on self-supervised features (including DINOv2) distinguishes it from methods combining multiple encoders or leveraging language modality. The broader Vision-Only branch emphasizes feature extraction without textual supervision, contrasting with the prompt engineering and cross-modal alignment strategies prevalent in adjacent branches.

Among the three contributions analyzed, none were clearly refuted by the 30 candidates examined. Contribution A (correlation between anomaly amount and embedding distance) examined 10 candidates with zero refutable matches. Contribution B (FoundAD manifold projection) and Contribution C (text-free multi-class framework) each examined 10 candidates, also with zero refutable matches. This suggests that within the limited search scope, the specific combination of manifold projection and multi-class detection using foundation encoders appears relatively underexplored. However, the sibling papers Anomalydino and MAEDAY likely share overlapping feature extraction strategies, indicating that the core novelty may reside in the projection operator design rather than the encoder choice itself.

Based on the limited literature search of 30 candidates, the work appears to occupy a moderately novel position within self-supervised vision-only anomaly detection. The absence of refutable prior work across all contributions suggests that the specific manifold projection approach and multi-class framework may not have direct precedents in the examined set. However, the analysis does not cover the full breadth of vision-language or generative methods, and the sibling papers indicate that foundation encoder usage for anomaly detection is an active area. The novelty likely hinges on the projection operator's design and efficiency claims rather than the foundational encoder concept.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: few-shot anomaly detection using foundation visual encoders. The field has evolved around leveraging pretrained visual representations to identify defects or outliers with minimal labeled examples. The taxonomy reveals several major branches: Vision-Language Model Adaptation methods harness CLIP-like architectures to align textual and visual cues for anomaly scoring, often through prompt engineering or fine-tuning strategies (e.g., Winclip[2], Promptad[6]). Vision-Only Foundation Models rely on self-supervised encoders such as DINO or masked autoencoders to extract rich features without language supervision (e.g., Anomalydino[11], MAEDAY[20]). Multimodal Large Language Models integrate reasoning capabilities to interpret visual anomalies in context (e.g., LLMs Visual Anomalies[23], Light MLLMAD[40]). Generative Model-Based Approaches synthesize or reconstruct normal patterns to highlight deviations, while Unified and Generalist Frameworks aim for cross-domain applicability (e.g., UniVAD[14], Generalist InContext Residual[7]). Domain-Specific Applications target sectors like medical imaging, fabric inspection, or industrial quality control, and Auxiliary Techniques provide methodological foundations such as outlier synthesis or curriculum learning. Recent work explores trade-offs between adaptation complexity and generalization: vision-language methods offer semantic interpretability but may require careful prompt design, whereas vision-only encoders like those in Anomalydino[11] or MAEDAY[20] emphasize robustness through self-supervised pretraining. Foundation Visual Encoders[0] sits within the Vision-Only Foundation Models branch, specifically among Self-Supervised Vision Encoders, sharing conceptual ground with Anomalydino[11] and MAEDAY[20]. While these neighbors leverage DINO-based or masked autoencoder features, Foundation Visual Encoders[0] likely emphasizes a distinct encoder architecture or training regime to capture anomaly-relevant patterns in few-shot settings. Open questions persist around optimal feature extraction strategies, the role of domain-specific fine-tuning versus zero-shot transfer, and how to balance computational efficiency with detection accuracy across diverse anomaly types.

Claimed Contributions

Correlation between anomaly amount and embedding distance in foundation encoders

10 retrieved papers

The authors reveal that foundation visual encoders exhibit a direct correlation between the pixel amount of anomalies in an image and the distance in their learned embedding space. This observation forms the basis for their anomaly detection approach.

10 retrieved papers

FoundAD: Few-shot anomaly detector using manifold projection

10 retrieved papers

The authors introduce FoundAD, a few-shot anomaly detection method that learns a lightweight nonlinear projection operator to map feature embeddings onto the natural image manifold learned by foundation models. The projector enables effective anomaly detection with minimal training samples.

10 retrieved papers

Text-free multi-class anomaly detection framework

10 retrieved papers

The authors demonstrate that foundation visual features alone, without textual assistance or prompts, are sufficient for effective few-shot anomaly detection. Their approach supports multi-class detection using substantially fewer parameters than prior methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[11] Anomalydino: Boosting patch-based few-shot anomaly detection with dinov2 PDF

Simon Damm, Mike Laszkiewicz, Johannes Lederer, Asja Fischer (2025)

[20] MAEDAY: MAE for few-and zero-shot AnomalY-Detection PDF

Eli Schwartz, Assaf Arbelle, Leonid Karlinsky, Sivan Harary, Florian Scheidegger, Sivan Doveh, Raja Giryes (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Correlation between anomaly amount and embedding distance in foundation encoders

[57] Simplenet: A simple network for image anomaly detection and localization PDF

Cannot Refute

[58] FastFlow: Unsupervised Anomaly Detection and Localization via 2D Normalizing Flows PDF

Cannot Refute

[59] Learning causal temporal relation and feature discrimination for anomaly detection PDF

Cannot Refute

[60] GLAE: globalâlocal feature autoencoder for image logical anomaly detection PDF

Cannot Refute

[61] Fast video anomaly detection via context-aware shortcut exploration and abnormal feature distance learning PDF

Cannot Refute

[62] Low-shot unsupervised visual anomaly detection via sparse feature representation PDF

Cannot Refute

[63] Student-Teacher Feature Pyramid Matching for Anomaly Detection PDF

Cannot Refute

[64] Modeling the Distribution of Normal Data in Pre-Trained Deep Features for Anomaly Detection PDF

Cannot Refute

[65] Patch distance based auto-encoder for industrial anomaly detection PDF

Cannot Refute

[66] CRD: Collaborative Representation Distance for Practical Anomaly Detection PDF

Cannot Refute

Contribution

FoundAD: Few-shot anomaly detector using manifold projection

[18] One-to-Normal: Anomaly Personalization for Few-shot Anomaly Detection PDF

Cannot Refute

[67] Unsupervised Anomaly Detection via Nonlinear Manifold Learning PDF

Cannot Refute

[68] Adversarial diffusion for few-shot scene adaptive video anomaly detection PDF

Cannot Refute

[69] Unified Flowing Normality Learning for Rotating Machinery Anomaly Detection in Continuous Time-Varying Conditions PDF

Cannot Refute

[70] Feature Encoding With Autoencoders for Weakly Supervised Anomaly Detection PDF

Cannot Refute

[71] Curved geometric networks for visual anomaly recognition PDF

Cannot Refute

[72] Manifolds for Unsupervised Visual Anomaly Detection PDF

Cannot Refute

[73] A parametric study of unsupervised anomaly detection performance in maritime imagery using manifold learning techniques PDF

Cannot Refute

[74] A Manifold LearningâBased Anomaly Detection Framework for Cardiovascular Disease Diagnosis PDF

Cannot Refute

[75] Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection PDF

Cannot Refute

Contribution

Text-free multi-class anomaly detection framework

[47] IIIM-SAM: Zero-Shot Texture Anomaly Detection Without External Prompts PDF

Cannot Refute

[48] A Diffusion-Based Framework for Multi-Class Anomaly Detection PDF

Cannot Refute

[49] Anomaly detection in cropland monitoring using multiple view vision transformer PDF

Cannot Refute

[50] A Multi-Category Anomaly Editing Network With Correlation Exploration and Voxel-Level Attention for Unsupervised Surface Anomaly Detection PDF

Cannot Refute

[51] Automatic Visual Inspection for Industrial Application PDF

Cannot Refute

[52] Hierarchical vector quantized transformer for multi-class unsupervised anomaly detection PDF

Cannot Refute

[53] Scalable industrial visual anomaly detection with partial semantics aggregation vision transformer PDF

Cannot Refute

[54] Multi-category decomposition editing network for the accurate visual inspection of texture defects PDF

Cannot Refute

[55] A machine vision anomaly detection system to industry 4.0 based on variational fuzzy autoencoder PDF

Cannot Refute

[56] Prototype-guided Two-branch Multi-class Anomaly Detection PDF

Cannot Refute

Foundation Visual Encoders Are Secretly Few-Shot Anomaly Detectors

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[11] Anomalydino: Boosting patch-based few-shot anomaly detection with dinov2 PDF

[20] MAEDAY: MAE for few-and zero-shot AnomalY-Detection PDF

Contribution Analysis

Correlation between anomaly amount and embedding distance in foundation encoders

[57] Simplenet: A simple network for image anomaly detection and localization PDF

[58] FastFlow: Unsupervised Anomaly Detection and Localization via 2D Normalizing Flows PDF

[59] Learning causal temporal relation and feature discrimination for anomaly detection PDF

[60] GLAE: globalâlocal feature autoencoder for image logical anomaly detection PDF

[61] Fast video anomaly detection via context-aware shortcut exploration and abnormal feature distance learning PDF

[62] Low-shot unsupervised visual anomaly detection via sparse feature representation PDF

[63] Student-Teacher Feature Pyramid Matching for Anomaly Detection PDF

[64] Modeling the Distribution of Normal Data in Pre-Trained Deep Features for Anomaly Detection PDF

[65] Patch distance based auto-encoder for industrial anomaly detection PDF

[66] CRD: Collaborative Representation Distance for Practical Anomaly Detection PDF

FoundAD: Few-shot anomaly detector using manifold projection

[18] One-to-Normal: Anomaly Personalization for Few-shot Anomaly Detection PDF

[67] Unsupervised Anomaly Detection via Nonlinear Manifold Learning PDF

[68] Adversarial diffusion for few-shot scene adaptive video anomaly detection PDF

[69] Unified Flowing Normality Learning for Rotating Machinery Anomaly Detection in Continuous Time-Varying Conditions PDF

[70] Feature Encoding With Autoencoders for Weakly Supervised Anomaly Detection PDF

[71] Curved geometric networks for visual anomaly recognition PDF

[72] Manifolds for Unsupervised Visual Anomaly Detection PDF

[73] A parametric study of unsupervised anomaly detection performance in maritime imagery using manifold learning techniques PDF

[74] A Manifold LearningâBased Anomaly Detection Framework for Cardiovascular Disease Diagnosis PDF

[75] Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection PDF

Text-free multi-class anomaly detection framework

[47] IIIM-SAM: Zero-Shot Texture Anomaly Detection Without External Prompts PDF

[48] A Diffusion-Based Framework for Multi-Class Anomaly Detection PDF

[49] Anomaly detection in cropland monitoring using multiple view vision transformer PDF

[50] A Multi-Category Anomaly Editing Network With Correlation Exploration and Voxel-Level Attention for Unsupervised Surface Anomaly Detection PDF

[51] Automatic Visual Inspection for Industrial Application PDF

[52] Hierarchical vector quantized transformer for multi-class unsupervised anomaly detection PDF

[53] Scalable industrial visual anomaly detection with partial semantics aggregation vision transformer PDF

[54] Multi-category decomposition editing network for the accurate visual inspection of texture defects PDF

[55] A machine vision anomaly detection system to industry 4.0 based on variational fuzzy autoencoder PDF

[56] Prototype-guided Two-branch Multi-class Anomaly Detection PDF

Table of Contents

[60] GLAE: globalâlocal feature autoencoder for image logical anomaly detection PDF

[74] A Manifold LearningâBased Anomaly Detection Framework for Cardiovascular Disease Diagnosis PDF