Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

ICLR 2026 Conference SubmissionAnonymous Authors
NeuroscienceFunctional Magnetic Resonance ImagingImage reconstructionReconstruction
Abstract:

Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli—essentially images—from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pre-trained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively.

We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision-based space or a joint text–image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object-centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute–relationship search module that automatically identifies key attributes and relationships that best align with the neural activity.

Extensive experiments on real-world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes PRISM, which projects fMRI signals into a structured text space to reconstruct visual stimuli. It sits within the Language-Centric Semantic Decoding leaf of the taxonomy, which contains six papers including the original work. This leaf is part of the broader Semantic and Multimodal Integration Approaches branch, indicating a moderately populated research direction. The approach differs from sibling methods by emphasizing structured compositional text representations rather than direct caption generation or unstructured semantic embeddings, positioning it at the intersection of language-based decoding and compositional modeling.

The taxonomy reveals that Language-Centric Semantic Decoding is one of three subtopics under Semantic and Multimodal Integration, alongside Vision-Language Model Integration (three papers) and Compositional and Attribute-Based Reconstruction (two papers). The paper's emphasis on structured text space connects it to both neighboring directions: it shares the language-centric philosophy with sibling papers like Seeing Through Brain and Brain Captioning, while its compositional modeling approach aligns with the attribute-based reconstruction leaf. This positioning suggests the work bridges two related but distinct research threads within the semantic integration paradigm.

Among thirty candidates examined, the PRISM framework contribution shows one refutable candidate from ten examined, while the other two contributions (text space alignment and compositional modeling) show no clear refutations from their respective ten-candidate searches. The limited refutation suggests that while the overall framework may have some overlap with prior work, the specific claims about text space superiority and compositional structure appear less directly challenged within the examined literature. However, the modest search scope means these findings reflect top-thirty semantic matches rather than exhaustive coverage of the field's approximately fifty papers.

Based on the limited search scope, the work appears to occupy a recognizable but not heavily saturated niche within language-centric fMRI decoding. The single refutable pair among thirty candidates suggests moderate novelty, though the analysis cannot rule out additional overlaps beyond the top-ranked semantic matches. The positioning between language-based and compositional approaches may offer differentiation, but definitive assessment would require broader literature coverage beyond the examined subset.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: reconstructing visual stimuli from fMRI brain signals. The field has evolved into several major branches that reflect different strategic emphases. Generative Model Architectures focus on leveraging modern diffusion and GAN-based frameworks to produce high-fidelity images directly from neural data, as seen in works like MindLDM[1] and Brain Diffuser[24]. Semantic and Multimodal Integration Approaches emphasize bridging brain signals with language or conceptual representations, often using pretrained vision-language models to guide reconstruction through semantic priors. Neural Encoding and Representation Learning investigates how to best map fMRI voxels to latent feature spaces that capture hierarchical visual information. Cross-Subject and Few-Shot Generalization tackles the challenge of training models that work across individuals or with limited data, while Dynamic and Temporal Visual Reconstruction extends methods to video or time-varying stimuli. Specialized Decoding Tasks explore applications such as emotion recognition or retinal imaging, and Methodological Advances refine optimization strategies and model architectures. Comprehensive Analyses and Benchmarking provide systematic evaluations across datasets and methods. Within the Semantic and Multimodal Integration branch, a particularly active line of work explores language-centric decoding, where textual descriptions or captions serve as intermediate representations to constrain and enrich image generation. Seeing Through Brain[0] exemplifies this direction by integrating linguistic semantic information to improve reconstruction fidelity and interpretability. Nearby efforts such as Image Captioning[8] and Brain Captioning[49] similarly leverage natural language as a bridge between neural activity and visual content, while Visual Semantic Reconstructions[10] and MindSemantix[19] explore how semantic embeddings can guide generative models. Compared to purely image-driven approaches like BigBiGAN Reconstruction[5] or Multimodal Brain Visual[3], these language-centric methods trade some direct pixel-level control for enhanced semantic coherence and cross-modal alignment, reflecting an ongoing tension between low-level perceptual accuracy and high-level conceptual fidelity.

Claimed Contributions

fMRI signals align better with language model text space than vision-based spaces

The authors demonstrate through empirical analysis using CKA, CCA, and Generalization Gap metrics that fMRI signals exhibit stronger alignment with pure text embeddings from language models compared to vision model representations or joint vision-language spaces, challenging the assumption that vision-based representations are essential for visual stimuli reconstruction.

10 retrieved papers
PRISM framework for fMRI-to-image reconstruction via structured text space

The authors introduce PRISM, a novel framework that maps fMRI signals into a structured text space capturing compositional visual information (objects, attributes, and relationships). The framework includes an object-centric diffusion module for compositional image generation and an attribute-relationship search module that automatically identifies brain-aligned attributes and relationships using a vision-language model.

10 retrieved papers
Can Refute
Structured text representations improve reconstruction through compositional modeling

The authors establish that adapting both the text space and generative model to explicitly represent the compositional structure of visual perception—distinguishing objects, their attributes, and inter-object relationships—leads to improved reconstruction quality compared to unified holistic representations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

fMRI signals align better with language model text space than vision-based spaces

The authors demonstrate through empirical analysis using CKA, CCA, and Generalization Gap metrics that fMRI signals exhibit stronger alignment with pure text embeddings from language models compared to vision model representations or joint vision-language spaces, challenging the assumption that vision-based representations are essential for visual stimuli reconstruction.

Contribution

PRISM framework for fMRI-to-image reconstruction via structured text space

The authors introduce PRISM, a novel framework that maps fMRI signals into a structured text space capturing compositional visual information (objects, attributes, and relationships). The framework includes an object-centric diffusion module for compositional image generation and an attribute-relationship search module that automatically identifies brain-aligned attributes and relationships using a vision-language model.

Contribution

Structured text representations improve reconstruction through compositional modeling

The authors establish that adapting both the text space and generative model to explicitly represent the compositional structure of visual perception—distinguishing objects, their attributes, and inter-object relationships—leads to improved reconstruction quality compared to unified holistic representations.