MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval

ICLR 2026 Conference SubmissionAnonymous Authors
Anomaly detectionZero-shot anomaly detectionMemory retrievalCLIP
Abstract:

Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP's learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. Code will be released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes MRAD, a memory-retrieval framework for zero-shot anomaly detection that replaces parametric fitting with direct similarity-based retrieval from two-level memory banks. It resides in the 'Pseudo-Anomaly and Correlation-Weighted Approaches' leaf under Vision-Language Model-Based Industrial Anomaly Detection, alongside two sibling papers. This leaf represents a focused research direction within the broader taxonomy of 25 papers across multiple modalities, suggesting a moderately active but not overcrowded area where CLIP-based industrial defect detection methods explore different strategies for zero-shot generalization.

The taxonomy reveals that MRAD's leaf sits within a larger branch of Vision-Language Model-Based Industrial Anomaly Detection, which also includes Multi-Scale Memory Comparison Frameworks and Additive Manufacturing Anomaly Detection. Neighboring branches address Video Anomaly Detection with Temporal Memory and Log Anomaly Detection with Retrieval Augmentation, indicating that memory-driven retrieval is a cross-cutting theme across modalities. The scope note for MRAD's leaf emphasizes pseudo-anomaly generation and correlation weighting, while explicitly excluding multi-scale memory comparison methods that appear in a sibling leaf, suggesting MRAD's single-scale retrieval approach occupies a distinct methodological niche.

Among the three contributions analyzed, the core MRAD framework examined ten candidates and found one refutable prior work, indicating some overlap in the memory-retrieval paradigm within the limited search scope. The MRAD-FT variant examined four candidates with no clear refutations, suggesting its lightweight fine-tuning approach may be more novel. The MRAD-CLIP variant examined ten candidates and found two refutable instances, implying that region-prior-guided dynamic prompts have more substantial prior exploration. These statistics reflect a search of 24 total candidates, not an exhaustive literature review, so the presence of refutable work indicates overlap within this specific sample rather than definitive lack of novelty.

Based on the limited search scope of 24 semantically similar candidates, the work appears to offer incremental refinements to memory-driven retrieval in zero-shot anomaly detection, with the MRAD-FT variant showing the least prior overlap. The taxonomy structure suggests the paper operates in a moderately explored area where CLIP-based industrial methods are actively being developed, though the specific combination of train-free retrieval and lightweight variants may differentiate it from existing pseudo-anomaly and correlation-weighted approaches.

Taxonomy

Core-task Taxonomy Papers
25
3
Claimed Contributions
24
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: zero-shot anomaly detection with memory-driven retrieval. The field organizes around several major branches that reflect different data modalities and application contexts. Vision-Language Model-Based Industrial Anomaly Detection leverages large-scale pretrained models (e.g., CLIP) to identify defects in manufacturing settings without task-specific training, often employing pseudo-anomaly synthesis or correlation-weighted strategies to bridge the gap between normal reference images and unseen anomalies. Video Anomaly Detection with Temporal Memory focuses on sequential data, using memory banks to capture normal motion and appearance patterns over time. Log Anomaly Detection with Retrieval Augmentation applies retrieval-augmented generation techniques to system logs, enabling zero-shot identification of operational faults. Unsupervised Memory-Based Anomaly Detection encompasses broader memory architectures that store prototypical normal features, while Multimodal Language Model-Driven Anomaly Detection integrates textual reasoning with visual or sensor inputs. Cross-Domain and Specialized Anomaly Detection addresses niche scenarios such as spectrum analysis or domain transfer, where memory retrieval helps generalize across diverse settings. Within Vision-Language Model-Based Industrial Anomaly Detection, a particularly active line of work explores pseudo-anomaly generation and correlation weighting to improve zero-shot performance. PA-CLIP[1] synthesizes artificial defects to guide the model's attention, while Correlation-Weighted Model[2] refines feature alignment between text prompts and visual patches. MRAD[0] sits squarely in this cluster, emphasizing memory-driven retrieval to dynamically select relevant normal exemplars and contrast them against test samples. Compared to PA-CLIP[1], which relies heavily on synthetic anomaly augmentation, MRAD[0] prioritizes retrieval mechanisms that adapt to varying defect types without explicit anomaly simulation. This approach contrasts with Correlation-Weighted Model[2], which focuses on optimizing prompt-image correlation rather than maintaining a structured memory bank. Across branches, a recurring theme is the trade-off between leveraging large pretrained models for generalization and designing specialized memory structures to capture domain-specific normality, with open questions around scalability and interpretability of retrieved references.

Claimed Contributions

MRAD framework with memory-driven retrieval paradigm

The authors introduce MRAD, a framework that constructs a two-level memory bank (image-level and pixel-level) from auxiliary data and performs anomaly detection through direct similarity retrieval rather than parametric model fitting. This approach stores feature-label pairs explicitly and obtains anomaly scores via retrieval during inference.

10 retrieved papers
Can Refute
MRAD-FT variant with lightweight fine-tuning

Building on the train-free base model, the authors propose MRAD-FT which adds only two linear layers to calibrate the retrieval metric. This lightweight fine-tuning improves discriminative ability for both classification and segmentation tasks while maintaining low training cost.

4 retrieved papers
MRAD-CLIP variant with region-prior-guided dynamic prompts

The authors develop MRAD-CLIP which enhances traditional prompt learning by injecting normal and anomalous region priors from MRAD-FT into learnable CLIP text prompts as dynamic biases. This approach improves cross-modal alignment, anomaly localization, and generalization to unseen categories compared to conventional dynamic prompt methods.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MRAD framework with memory-driven retrieval paradigm

The authors introduce MRAD, a framework that constructs a two-level memory bank (image-level and pixel-level) from auxiliary data and performs anomaly detection through direct similarity retrieval rather than parametric model fitting. This approach stores feature-label pairs explicitly and obtains anomaly scores via retrieval during inference.

Contribution

MRAD-FT variant with lightweight fine-tuning

Building on the train-free base model, the authors propose MRAD-FT which adds only two linear layers to calibrate the retrieval metric. This lightweight fine-tuning improves discriminative ability for both classification and segmentation tasks while maintaining low training cost.

Contribution

MRAD-CLIP variant with region-prior-guided dynamic prompts

The authors develop MRAD-CLIP which enhances traditional prompt learning by injecting normal and anomalous region priors from MRAD-FT into learnable CLIP text prompts as dynamic biases. This approach improves cross-modal alignment, anomaly localization, and generalization to unseen categories compared to conventional dynamic prompt methods.