Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

NeuroscienceFunctional Magnetic Resonance ImagingImage reconstructionReconstruction

Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli—essentially images—from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pre-trained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively.

We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision-based space or a joint text–image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object-centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute–relationship search module that automatically identifies key attributes and relationships that best align with the neural activity.

Extensive experiments on real-world datasets demonstrate that our framework outperforms existing methods, achieving up to an 8% reduction in perceptual loss. These results highlight the importance of using structured text as the intermediate space to bridge fMRI signals and image reconstruction.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes PRISM, which projects fMRI signals into a structured text space to reconstruct visual stimuli. It sits within the Language-Centric Semantic Decoding leaf of the taxonomy, which contains six papers including the original work. This leaf is part of the broader Semantic and Multimodal Integration Approaches branch, indicating a moderately populated research direction. The approach differs from sibling methods by emphasizing structured compositional text representations rather than direct caption generation or unstructured semantic embeddings, positioning it at the intersection of language-based decoding and compositional modeling.

The taxonomy reveals that Language-Centric Semantic Decoding is one of three subtopics under Semantic and Multimodal Integration, alongside Vision-Language Model Integration (three papers) and Compositional and Attribute-Based Reconstruction (two papers). The paper's emphasis on structured text space connects it to both neighboring directions: it shares the language-centric philosophy with sibling papers like Seeing Through Brain and Brain Captioning, while its compositional modeling approach aligns with the attribute-based reconstruction leaf. This positioning suggests the work bridges two related but distinct research threads within the semantic integration paradigm.

Among thirty candidates examined, the PRISM framework contribution shows one refutable candidate from ten examined, while the other two contributions (text space alignment and compositional modeling) show no clear refutations from their respective ten-candidate searches. The limited refutation suggests that while the overall framework may have some overlap with prior work, the specific claims about text space superiority and compositional structure appear less directly challenged within the examined literature. However, the modest search scope means these findings reflect top-thirty semantic matches rather than exhaustive coverage of the field's approximately fifty papers.

Based on the limited search scope, the work appears to occupy a recognizable but not heavily saturated niche within language-centric fMRI decoding. The single refutable pair among thirty candidates suggests moderate novelty, though the analysis cannot rule out additional overlaps beyond the top-ranked semantic matches. The positioning between language-based and compositional approaches may offer differentiation, but definitive assessment would require broader literature coverage beyond the examined subset.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reconstructing visual stimuli from fMRI brain signals. The field has evolved into several major branches that reflect different strategic emphases. Generative Model Architectures focus on leveraging modern diffusion and GAN-based frameworks to produce high-fidelity images directly from neural data, as seen in works like MindLDM[1] and Brain Diffuser[24]. Semantic and Multimodal Integration Approaches emphasize bridging brain signals with language or conceptual representations, often using pretrained vision-language models to guide reconstruction through semantic priors. Neural Encoding and Representation Learning investigates how to best map fMRI voxels to latent feature spaces that capture hierarchical visual information. Cross-Subject and Few-Shot Generalization tackles the challenge of training models that work across individuals or with limited data, while Dynamic and Temporal Visual Reconstruction extends methods to video or time-varying stimuli. Specialized Decoding Tasks explore applications such as emotion recognition or retinal imaging, and Methodological Advances refine optimization strategies and model architectures. Comprehensive Analyses and Benchmarking provide systematic evaluations across datasets and methods. Within the Semantic and Multimodal Integration branch, a particularly active line of work explores language-centric decoding, where textual descriptions or captions serve as intermediate representations to constrain and enrich image generation. Seeing Through Brain[0] exemplifies this direction by integrating linguistic semantic information to improve reconstruction fidelity and interpretability. Nearby efforts such as Image Captioning[8] and Brain Captioning[49] similarly leverage natural language as a bridge between neural activity and visual content, while Visual Semantic Reconstructions[10] and MindSemantix[19] explore how semantic embeddings can guide generative models. Compared to purely image-driven approaches like BigBiGAN Reconstruction[5] or Multimodal Brain Visual[3], these language-centric methods trade some direct pixel-level control for enhanced semantic coherence and cross-modal alignment, reflecting an ongoing tension between low-level perceptual accuracy and high-level conceptual fidelity.

Claimed Contributions

fMRI signals align better with language model text space than vision-based spaces

10 retrieved papers

The authors demonstrate through empirical analysis using CKA, CCA, and Generalization Gap metrics that fMRI signals exhibit stronger alignment with pure text embeddings from language models compared to vision model representations or joint vision-language spaces, challenging the assumption that vision-based representations are essential for visual stimuli reconstruction.

10 retrieved papers

PRISM framework for fMRI-to-image reconstruction via structured text space

Can Refute

10 retrieved papers

The authors introduce PRISM, a novel framework that maps fMRI signals into a structured text space capturing compositional visual information (objects, attributes, and relationships). The framework includes an object-centric diffusion module for compositional image generation and an attribute-relationship search module that automatically identifies brain-aligned attributes and relationships using a vision-language model.

10 retrieved papers

Can Refute

Structured text representations improve reconstruction through compositional modeling

10 retrieved papers

The authors establish that adapting both the text space and generative model to explicitly represent the compositional structure of visual perception—distinguishing objects, their attributes, and inter-object relationships—leads to improved reconstruction quality compared to unified holistic representations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Improved image reconstruction from brain activity through automatic image captioning PDF

Fatemeh Kalantari, Karim Faez, Hamidreza Amindavar, Soheila Nazari, H. Amindavar, S. Nazari (2025)

[10] Beyond brain decoding: Visual-semantic reconstructions to mental creation extension based on fmri PDF

H Jing, D Jiang, Y Ma, H Hua (2025)

[19] Mindsemantix: Deciphering brain visual experiences with a brain-language model PDF

Ren ZiQi, Li Jie, Ziqi Ren, Xue, Xuetong, Jie Li, LI Xin, Xuetong Xue, Yang Fan, Xin Li, Jiao Zhicheng, Fan Yang, Gao, Xinbo, Z. Jiao, Xinbo Gao (2024)

[26] UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity PDF

Weijian Mai, Zhang Zhijun, Zhijun Zhang (2023)

[49] Brain Captioning: Decoding human brain activity into images and text PDF

Ferrante, Matteo, Ozcelik, Furkan, Matteo Ferrante, Boccato, Tommaso, Furkan Ozcelik, VanRullen, Rufin, T. Boccato, Toschi, Nicola, R. V. Rullen, N. Toschi (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

fMRI signals align better with language model text space than vision-based spaces

[10] Beyond brain decoding: Visual-semantic reconstructions to mental creation extension based on fmri PDF

Cannot Refute

[43] Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction PDF

Cannot Refute

[52] Brain-streams: fMRI-to-image reconstruction with multi-modal guidance PDF

Cannot Refute

[59] Language models align with brain regions that represent concepts across modalities PDF

Cannot Refute

[60] Mind2word: Towards generalized visual neural representations for high-quality video reconstruction PDF

Cannot Refute

[61] Llm4brain: Training a large language model for brain video understanding PDF

Cannot Refute

[62] Modeling the human visual system: Comparative insights from response-optimized and task-optimized vision models, language models, and different readout â¦ PDF

Cannot Refute

[63] Brainchat: Decoding semantic information from fmri using vision-language pretrained models PDF

Cannot Refute

[64] BrainChat: Interactive Semantic Information Decoding from fMRI Using Large-Scale Vision-Language Pretrained Models PDF

Cannot Refute

[65] Modality-agnostic fmri decoding of vision and language PDF

Cannot Refute

Contribution

PRISM framework for fMRI-to-image reconstruction via structured text space

[2] Mind reader: Reconstructing complex images from brain activities PDF

Can Refute

[1] MindLDM: Reconstruct Visual Stimuli from fMRI Using Latent Diffusion Model PDF

Cannot Refute

[51] Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction PDF

Cannot Refute

[52] Brain-streams: fMRI-to-image reconstruction with multi-modal guidance PDF

Cannot Refute

[53] Neural encoding and decoding with distributed sentence representations PDF

Cannot Refute

[54] NeuroAdapter: Visual Reconstruction with Masked Brain Representation PDF

Cannot Refute

[55] Alleviating the semantic gap for generalized fmri-to-image reconstruction PDF

Cannot Refute

[56] Generative multimodal decoding: Reconstructing images and text from human fMRI PDF

Cannot Refute

[57] MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding PDF

Cannot Refute

[58] Coherent Language Reconstruction from Brain Recordings with Flexible Multi-Modal Input Stimuli PDF

Cannot Refute

Contribution

Structured text representations improve reconstruction through compositional modeling

[66] Evaluating Text-to-Visual Generation with Image-to-Text Generation PDF

Cannot Refute

[67] Learning visual composition through improved semantic guidance PDF

Cannot Refute

[68] Attribute-Centric Compositional Text-to-Image Generation PDF

Cannot Refute

[69] TextPSG: Panoptic Scene Graph Generation from Textual Descriptions PDF

Cannot Refute

[70] Evaluating and Improving Compositional Text-to-Visual Generation PDF

Cannot Refute

[71] MARS: Paying More Attention to Visual Attributes for Text-Based Person Search PDF

Cannot Refute

[72] Generating semantically precise scene graphs from textual descriptions for improved image retrieval PDF

Cannot Refute

[73] Text2scene: Generating compositional scenes from textual descriptions PDF

Cannot Refute

[74] Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout PDF

Cannot Refute

[75] SceneEval: Evaluating semantic coherence in text-conditioned 3D indoor scene synthesis PDF

Cannot Refute

Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Improved image reconstruction from brain activity through automatic image captioning PDF

[10] Beyond brain decoding: Visual-semantic reconstructions to mental creation extension based on fmri PDF

[19] Mindsemantix: Deciphering brain visual experiences with a brain-language model PDF

[26] UniBrain: Unify Image Reconstruction and Captioning All in One Diffusion Model from Human Brain Activity PDF

[49] Brain Captioning: Decoding human brain activity into images and text PDF

Contribution Analysis

fMRI signals align better with language model text space than vision-based spaces

[10] Beyond brain decoding: Visual-semantic reconstructions to mental creation extension based on fmri PDF

[43] Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction PDF

[52] Brain-streams: fMRI-to-image reconstruction with multi-modal guidance PDF

[59] Language models align with brain regions that represent concepts across modalities PDF

[60] Mind2word: Towards generalized visual neural representations for high-quality video reconstruction PDF

[61] Llm4brain: Training a large language model for brain video understanding PDF

[62] Modeling the human visual system: Comparative insights from response-optimized and task-optimized vision models, language models, and different readout â¦ PDF

[63] Brainchat: Decoding semantic information from fmri using vision-language pretrained models PDF

[64] BrainChat: Interactive Semantic Information Decoding from fMRI Using Large-Scale Vision-Language Pretrained Models PDF

[65] Modality-agnostic fmri decoding of vision and language PDF

PRISM framework for fMRI-to-image reconstruction via structured text space

[2] Mind reader: Reconstructing complex images from brain activities PDF

[1] MindLDM: Reconstruct Visual Stimuli from fMRI Using Latent Diffusion Model PDF

[51] Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction PDF

[52] Brain-streams: fMRI-to-image reconstruction with multi-modal guidance PDF

[53] Neural encoding and decoding with distributed sentence representations PDF

[54] NeuroAdapter: Visual Reconstruction with Masked Brain Representation PDF

[55] Alleviating the semantic gap for generalized fmri-to-image reconstruction PDF

[56] Generative multimodal decoding: Reconstructing images and text from human fMRI PDF

[57] MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding PDF

[58] Coherent Language Reconstruction from Brain Recordings with Flexible Multi-Modal Input Stimuli PDF

Structured text representations improve reconstruction through compositional modeling

[66] Evaluating Text-to-Visual Generation with Image-to-Text Generation PDF

[67] Learning visual composition through improved semantic guidance PDF

[68] Attribute-Centric Compositional Text-to-Image Generation PDF

[69] TextPSG: Panoptic Scene Graph Generation from Textual Descriptions PDF

[70] Evaluating and Improving Compositional Text-to-Visual Generation PDF

[71] MARS: Paying More Attention to Visual Attributes for Text-Based Person Search PDF

[72] Generating semantically precise scene graphs from textual descriptions for improved image retrieval PDF

[73] Text2scene: Generating compositional scenes from textual descriptions PDF

[74] Componerf: Text-guided multi-object compositional nerf with editable 3d scene layout PDF

[75] SceneEval: Evaluating semantic coherence in text-conditioned 3D indoor scene synthesis PDF

Table of Contents

[62] Modeling the human visual system: Comparative insights from response-optimized and task-optimized vision models, language models, and different readout â¦ PDF