Vision Hopfield Memory Networks

ICLR 2026 Conference SubmissionAnonymous Authors
Associative MemoryHopfield NetworksImage Classification
Abstract:

Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding–inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space–based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: brain-inspired vision backbone with hierarchical memory mechanisms. The field encompasses diverse approaches to integrating memory and biological principles into visual processing systems. At the broadest level, the taxonomy reveals several major branches: hierarchical memory architectures that organize multi-level storage for vision tasks, neuromorphic hardware implementations that exploit in-memory computing substrates, spiking neural networks that mimic temporal dynamics of biological neurons, attention and recurrent mechanisms that enable iterative refinement, biologically inspired feature extraction methods that mirror cortical organization, memory-augmented learning frameworks for recognition, hierarchical temporal memory (HTM) algorithms, computational models of visual cortex structure, and neuroscience-informed cognitive architectures. Some branches emphasize hardware efficiency and novel computing paradigms (Neuromorphic Visual Resistive[3], In-sensor Image Memorization[4]), while others focus on algorithmic innovations such as spiking dynamics (Hierarchical Spiking Classification[6]) or cortex-inspired hierarchies (Hierarchical Activation Backbone[10]). The interplay between these branches reflects ongoing efforts to balance biological fidelity, computational tractability, and practical performance. Particularly active lines of work explore multi-memory integration frameworks for embodied agents, where systems must coordinate short-term sensorimotor memory with long-term episodic storage—exemplified by RoboMemory Lifelong[1] and RoboMemory Interactive[2], which address continual learning and interactive scenarios in robotics. Vision Hopfield Memory[0] sits within this cluster, emphasizing associative memory mechanisms that enable robust retrieval and pattern completion in visual backbones. Compared to the RoboMemory works that target embodied agent workflows, Vision Hopfield Memory[0] focuses more directly on the backbone architecture itself, leveraging Hopfield-style dynamics to create hierarchical memory layers. This contrasts with approaches like Neural Brain Framework[16], which integrates broader cognitive modeling, and with hardware-centric efforts such as Hierarchical Interactive In-memory[5] that prioritize physical substrate design. The central trade-off across these directions involves the granularity of memory organization, the degree of biological inspiration, and the balance between general-purpose learning and task-specific optimization.

Claimed Contributions

Vision Hopfield Memory Network (V-HMN) architecture

The authors introduce V-HMN, a novel vision backbone that replaces conventional self-attention or convolution with hierarchical Hopfield-style associative memory modules. The architecture combines local memory for patch-level pattern completion and global memory for scene-level context, organized in a unified framework with iterative refinement.

8 retrieved papers
Predictive-coding–inspired iterative refinement mechanism

The authors develop a lightweight refinement update rule where representations are gradually corrected toward memory-predicted prototypes through learnable error-correction steps. This mechanism provides an interpretable, brain-inspired alternative to purely feedforward processing.

10 retrieved papers
Can Refute
Class-balanced persistent memory banks with content-addressable retrieval

The authors design explicit memory banks that store real sample embeddings in a class-balanced manner during training and remain frozen during inference. These banks enable content-addressable retrieval where stored prototypes act as reusable priors, improving data efficiency and interpretability.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Vision Hopfield Memory Network (V-HMN) architecture

The authors introduce V-HMN, a novel vision backbone that replaces conventional self-attention or convolution with hierarchical Hopfield-style associative memory modules. The architecture combines local memory for patch-level pattern completion and global memory for scene-level context, organized in a unified framework with iterative refinement.

Contribution

Predictive-coding–inspired iterative refinement mechanism

The authors develop a lightweight refinement update rule where representations are gradually corrected toward memory-predicted prototypes through learnable error-correction steps. This mechanism provides an interpretable, brain-inspired alternative to purely feedforward processing.

Contribution

Class-balanced persistent memory banks with content-addressable retrieval

The authors design explicit memory banks that store real sample embeddings in a class-balanced manner during training and remain frozen during inference. These banks enable content-addressable retrieval where stored prototypes act as reusable priors, improving data efficiency and interpretability.

Vision Hopfield Memory Networks | Novelty Validation