MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Image GenerationTest-TimeLatent Reasoning

Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes MILR, a test-time method for multimodal image generation that performs joint reasoning over image and text in a unified latent vector space using policy gradient optimization. It resides in the 'Unified Multimodal Latent Reasoning' leaf, which contains five papers total including the original work. This leaf sits within the broader 'Latent-Space Reasoning Mechanisms' branch, indicating a moderately populated research direction focused on continuous latent manipulation rather than explicit intermediate generation or text-only reasoning.

The taxonomy reveals neighboring directions that contextualize MILR's positioning. Adjacent leaves include 'Modality-Specific Latent Reasoning' (two papers performing reasoning within individual modalities before fusion) and 'Emotional and Affective Latent Reasoning' (one paper on emotion-focused latent manipulation). Nearby branches explore 'Visual Thought Generation' (methods creating intermediate visual sketches) and 'Inference-Time Optimization' (test-time refinement strategies like reflection and search). MILR bridges latent reasoning and test-time optimization, distinguishing itself by operating in a unified cross-modal latent space rather than modality-specific or visual-artifact-based approaches.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The first contribution (MILR method) examined ten candidates with zero refutable matches; the second (policy gradient optimization) and third (unified framework instantiation) showed identical patterns. This suggests that within the limited search scope, no prior work directly overlaps with MILR's specific combination of test-time latent reasoning, cross-modal unification, and policy gradient guidance. However, the analysis explicitly covers top-K semantic matches and citation expansion, not an exhaustive literature review.

Based on the limited search scope of thirty candidates, MILR appears to occupy a distinct position combining test-time adaptability with unified multimodal latent reasoning. The absence of refutable candidates among examined papers suggests novelty within the sampled literature, though the analysis does not claim exhaustive coverage of all potentially relevant prior work in latent reasoning or inference-time optimization.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multimodal image generation via test-time latent reasoning. This emerging field explores how generative models can perform deliberate reasoning steps in latent space during inference to produce higher-quality or more controllable images. The taxonomy reveals several complementary directions: Latent-Space Reasoning Mechanisms investigate how models manipulate internal representations to refine outputs, while Visual Thought Generation and Manipulation focuses on creating intermediate visual or symbolic sketches that guide synthesis. Inference-Time Optimization and Scaling examines computational strategies for iterative refinement, and Chain-of-Thought and Contextual Reasoning adapts language-model-style step-by-step planning to visual domains. Meanwhile, Semantic Guidance and Correction addresses error detection and attribute alignment, Evaluation and Reward Modeling develops metrics to steer generation quality, Cross-Modal Information Transfer and Hierarchical Modeling handles integration across modalities, and Specialized Application Domains targets concrete use cases such as video or personalized content creation. Representative works like Continuous Thought[6] and ImageGen CoT[5] illustrate how reasoning can be embedded at test time, while Latent Sketchpad[3] demonstrates intermediate visual planning. A particularly active line of work centers on unified multimodal latent reasoning, where models jointly process text and image signals within a shared latent space to enable flexible test-time adjustments. MILR[0] exemplifies this approach by performing reasoning directly in the latent manifold, closely aligning with neighbors such as Monet[13], which also emphasizes cross-modal latent integration, and Dynamic Multimodal Interleaving[18], which explores alternating modalities during inference. In contrast, Reasoning in Dark[19] investigates reasoning without explicit intermediate outputs, highlighting a trade-off between interpretability and efficiency. Across these branches, open questions persist around scaling inference-time computation (as explored in Inference Time Scaling[8]), balancing semantic correctness with creative flexibility (addressed by Semantic Correction[15] and Customized Reward Models[14]), and generalizing latent reasoning to diverse application domains like video synthesis (Agentic VJ System[20]) or personalized generation (Similar Subject Generation[16]). MILR[0] sits within the core latent reasoning cluster, distinguished by its emphasis on test-time adaptability and multimodal unification, offering a middle ground between purely optimization-driven methods and explicit chain-of-thought pipelines.

Claimed Contributions

MILR: test-time latent reasoning method for multimodal image generation

10 retrieved papers

The authors introduce MILR, a test-time optimization method that performs joint image-text reasoning by searching over continuous vector representations of discrete image and text tokens in a shared latent space, rather than reasoning over raw images and text explicitly.

10 retrieved papers

Policy gradient optimization for unified latent space reasoning

10 retrieved papers

The authors implement latent reasoning using the REINFORCE policy gradient algorithm, where gradients are back-propagated only to intermediate model outputs (latent representations) without modifying model parameters, enabling test-time optimization guided by a reward model.

10 retrieved papers

Instantiation within unified multimodal understanding and generation framework

10 retrieved papers

The authors build MILR on top of MUG models that support language reasoning before image generation, using the intermediate model outputs as the unified latent space to enable cross-modal reasoning entirely at test time.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Multimodal chain of continuous thought for latent-space reasoning in vision-language models PDF

Tan-Hanh Pham, Chris Ngo (2025)

[13] Monet: Reasoning in latent visual space beyond images and language PDF

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang (2025)

[18] Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space PDF

Chengzhi Liu, Yuzhe Yang, Yue Fan, Qingyue Wei, Sheng Liu, Xin Eric Wang (2025)

[19] Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space PDF

Chen Chao, Ma Zhixin, Li Yongqi, Hu Yupeng, Wei, Yinwei, Li, Wenjie, Nie, Liqiang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MILR: test-time latent reasoning method for multimodal image generation

[6] Multimodal chain of continuous thought for latent-space reasoning in vision-language models PDF

Cannot Refute

[11] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens PDF

Cannot Refute

[19] Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space PDF

Cannot Refute

[34] Multimodal Reasoning with Multimodal Knowledge Graph PDF

Cannot Refute

[35] Look, Think, Understand: Multimodal Reasoning for Socially-Aware Robotics PDF

Cannot Refute

[36] Uniter: Learning universal image-text representations PDF

Cannot Refute

[37] LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space PDF

Cannot Refute

[38] Reducing Hallucinations in Vision-Language Models via Latent Space Steering PDF

Cannot Refute

[39] I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models PDF

Cannot Refute

[40] Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval PDF

Cannot Refute

Contribution

Policy gradient optimization for unified latent space reasoning

[24] EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning PDF

Cannot Refute

[25] Hierarchical Budget Policy Optimization for Adaptive Reasoning PDF

Cannot Refute

[26] Inpainting-Guided Policy Optimization for Diffusion Large Language Models PDF

Cannot Refute

[27] Learning to Ponder: Adaptive Reasoning in Latent Space PDF

Cannot Refute

[28] Ctrls: Chain-of-thought reasoning via latent state-transition PDF

Cannot Refute

[29] Hybrid Latent Reasoning via Reinforcement Learning PDF

Cannot Refute

[30] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization PDF

Cannot Refute

[31] Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model PDF

Cannot Refute

[32] Exploration and regularization of the latent action space in recommendation PDF

Cannot Refute

[33] Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems PDF

Cannot Refute

Contribution

Instantiation within unified multimodal understanding and generation framework

[41] Multimodal Chain-of-Thought Reasoning in Language Models PDF

Cannot Refute

[42] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

Cannot Refute

[43] Show-o: One single transformer to unify multimodal understanding and generation PDF

Cannot Refute

[44] Generating images with multimodal language models PDF

Cannot Refute

[45] Mindomni: Unleashing reasoning generation in vision language models with rgpo PDF

Cannot Refute

[46] Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language PDF

Cannot Refute

[47] Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms PDF

Cannot Refute

[48] Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers PDF

Cannot Refute

[49] Imagic: Text-Based Real Image Editing with Diffusion Models PDF

Cannot Refute

[50] Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey PDF

Cannot Refute

MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Multimodal chain of continuous thought for latent-space reasoning in vision-language models PDF

[13] Monet: Reasoning in latent visual space beyond images and language PDF

[18] Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space PDF

[19] Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space PDF

Contribution Analysis

MILR: test-time latent reasoning method for multimodal image generation

[6] Multimodal chain of continuous thought for latent-space reasoning in vision-language models PDF

[11] Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens PDF

[19] Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space PDF

[34] Multimodal Reasoning with Multimodal Knowledge Graph PDF

[35] Look, Think, Understand: Multimodal Reasoning for Socially-Aware Robotics PDF

[36] Uniter: Learning universal image-text representations PDF

[37] LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space PDF

[38] Reducing Hallucinations in Vision-Language Models via Latent Space Steering PDF

[39] I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models PDF

[40] Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval PDF

Policy gradient optimization for unified latent space reasoning

[24] EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning PDF

[25] Hierarchical Budget Policy Optimization for Adaptive Reasoning PDF

[26] Inpainting-Guided Policy Optimization for Diffusion Large Language Models PDF

[27] Learning to Ponder: Adaptive Reasoning in Latent Space PDF

[28] Ctrls: Chain-of-thought reasoning via latent state-transition PDF

[29] Hybrid Latent Reasoning via Reinforcement Learning PDF

[30] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization PDF

[31] Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model PDF

[32] Exploration and regularization of the latent action space in recommendation PDF

[33] Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems PDF

Instantiation within unified multimodal understanding and generation framework

[41] Multimodal Chain-of-Thought Reasoning in Language Models PDF

[42] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

[43] Show-o: One single transformer to unify multimodal understanding and generation PDF

[44] Generating images with multimodal language models PDF

[45] Mindomni: Unleashing reasoning generation in vision language models with rgpo PDF

[46] Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language PDF

[47] Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms PDF

[48] Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers PDF

[49] Imagic: Text-Based Real Image Editing with Diffusion Models PDF

[50] Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey PDF

Table of Contents