Generating metamers of human scene understanding

ICLR 2026 Conference SubmissionAnonymous Authors
human scene understandinggenerative modeling
Abstract:

Human vision combines low-resolution “gist” information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. “foveated”) inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a “same” or “different” response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers’ own fixated regions.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

MetamerGen introduces a latent diffusion model that generates scene metamers by combining peripheral gist information with high-resolution fixation data. The paper resides in the Early Visual System Pooling Models leaf, which contains ten papers focused on eccentricity-scaled feature averaging to simulate early visual processing. This is the most populated leaf in the Computational Models branch, indicating a crowded research direction where foundational pooling approaches have been extensively explored. The work extends classical pooling theory to complex natural scenes using modern generative architectures.

The taxonomy reveals neighboring leaves addressing ventral stream modeling (two papers on mid-to-high level features), texture statistics methods (two papers on windowed feature distributions), and neural network approaches (four papers on deep learning-based metamer generation). MetamerGen bridges early visual pooling with deep learning methods, sitting at the boundary between hand-crafted feature models and learned representations. The dual-stream DINOv2 conditioning mechanism connects to the Neural Network and Deep Learning Approaches leaf, which explores encoder-decoder and diffusion architectures, though those works typically do not emphasize foveated input structures or scene-level gist integration.

Among eleven candidates examined, the core MetamerGen contribution shows one refutable candidate from five examined, suggesting some overlap with prior diffusion-based metamer generation. The dual-stream DINOv2 conditioning examined two candidates with no clear refutation, indicating potential novelty in the specific architectural fusion of peripheral and foveated features. The behavioral paradigm contribution examined four candidates without refutation, though this may reflect the limited search scope rather than definitive novelty. The analysis covers top-K semantic matches and does not constitute an exhaustive literature review.

Based on the limited search scope, MetamerGen appears to offer architectural innovations in fusing foveated and peripheral representations within diffusion models, though the core idea of generating metamers via learned features has precedent. The work's positioning in a crowded leaf suggests incremental refinement rather than paradigm shift, but the dual-stream conditioning and scene-level synthesis may represent meaningful extensions. A broader literature search would be needed to assess whether similar foveated diffusion architectures exist outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers
35
3
Claimed Contributions
11
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: generating image metamers from foveated and peripheral visual information. The field centers on creating images that appear perceptually identical to originals despite differing physically, exploiting the human visual system's spatially varying sensitivity. The taxonomy reveals four main branches. Computational Models of Peripheral Vision and Metamer Generation focuses on algorithmic approaches to pooling and texture synthesis that mimic early visual processing, with works like Ventral Stream Metamers[2] and Foveated Metamers[1] establishing foundational pooling models. Rendering Applications and Display Systems translates these models into practical technologies for virtual reality and holographic displays, as seen in Metameric Varifocal Holograms[4] and Real-time Ventral Metamers[3]. Perceptual Validation and Psychophysical Studies empirically tests whether generated metamers truly match human perception through controlled experiments. Biological and Neural Foundations grounds the computational work in neuroscience, examining how retinal and cortical mechanisms constrain metamer generation. A particularly active line explores trade-offs between computational efficiency and perceptual fidelity, with Real-time Ventral Metamers[3] and Gaze-Centric Metamer Computation[13] pushing toward interactive applications while maintaining perceptual accuracy. Another thread investigates the biological plausibility of pooling models, questioning whether texture statistics alone suffice or whether deeper neural constraints matter, as in Biological Plausibility Metamers[10]. Generating Scene Metamers[0] sits within the Early Visual System Pooling Models cluster, closely aligned with foundational works like Foveated Metamers[1] and Early Visual Metamers[35] that emphasize texture-statistic pooling. Compared to Foveated Metamers Response[11] and eLife Foveated Metamers[21], which focus on refining perceptual validation, Generating Scene Metamers[0] appears more oriented toward extending computational synthesis methods to complex natural scenes, bridging classical pooling theory with modern scene generation challenges.

Claimed Contributions

MetamerGen: a latent diffusion model for generating scene metamers

The authors propose MetamerGen, a latent diffusion model that combines peripheral gist information with fixation-based foveal information to generate image metamers aligned with human scene understanding. The model uses a dual-stream representation of foveated scenes with DINOv2 tokens to fuse detailed features from fixated areas with peripherally degraded features.

5 retrieved papers
Can Refute
Dual-stream DINOv2-based conditioning mechanism for foveated image synthesis

The authors develop a novel conditioning approach that uses separate DINOv2 feature streams for foveal (fixated) and peripheral (blurred) visual information. These streams are integrated through adapter-based cross-attention mechanisms into a pretrained Stable Diffusion model, enabling generation from variable-resolution inputs.

2 retrieved papers
Real-time behavioral paradigm for identifying scene metamers

The authors introduce a gaze-contingent experimental paradigm where participants view scenes for variable fixations, followed by a brief presentation of either the original or a MetamerGen-generated image. This paradigm enables identification of which generated scenes are true metamers for human scene understanding through same-different judgments.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MetamerGen: a latent diffusion model for generating scene metamers

The authors propose MetamerGen, a latent diffusion model that combines peripheral gist information with fixation-based foveal information to generate image metamers aligned with human scene understanding. The model uses a dual-stream representation of foveated scenes with DINOv2 tokens to fuse detailed features from fixated areas with peripherally degraded features.

Contribution

Dual-stream DINOv2-based conditioning mechanism for foveated image synthesis

The authors develop a novel conditioning approach that uses separate DINOv2 feature streams for foveal (fixated) and peripheral (blurred) visual information. These streams are integrated through adapter-based cross-attention mechanisms into a pretrained Stable Diffusion model, enabling generation from variable-resolution inputs.

Contribution

Real-time behavioral paradigm for identifying scene metamers

The authors introduce a gaze-contingent experimental paradigm where participants view scenes for variable fixations, followed by a brief presentation of either the original or a MetamerGen-generated image. This paradigm enables identification of which generated scenes are true metamers for human scene understanding through same-different judgments.