MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Image GenerationTest-TimeLatent Reasoning
Abstract:

Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes MILR, a test-time method for multimodal image generation that performs joint reasoning over image and text in a unified latent vector space using policy gradient optimization. It resides in the 'Unified Multimodal Latent Reasoning' leaf, which contains five papers total including the original work. This leaf sits within the broader 'Latent-Space Reasoning Mechanisms' branch, indicating a moderately populated research direction focused on continuous latent manipulation rather than explicit intermediate generation or text-only reasoning.

The taxonomy reveals neighboring directions that contextualize MILR's positioning. Adjacent leaves include 'Modality-Specific Latent Reasoning' (two papers performing reasoning within individual modalities before fusion) and 'Emotional and Affective Latent Reasoning' (one paper on emotion-focused latent manipulation). Nearby branches explore 'Visual Thought Generation' (methods creating intermediate visual sketches) and 'Inference-Time Optimization' (test-time refinement strategies like reflection and search). MILR bridges latent reasoning and test-time optimization, distinguishing itself by operating in a unified cross-modal latent space rather than modality-specific or visual-artifact-based approaches.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the work. The first contribution (MILR method) examined ten candidates with zero refutable matches; the second (policy gradient optimization) and third (unified framework instantiation) showed identical patterns. This suggests that within the limited search scope, no prior work directly overlaps with MILR's specific combination of test-time latent reasoning, cross-modal unification, and policy gradient guidance. However, the analysis explicitly covers top-K semantic matches and citation expansion, not an exhaustive literature review.

Based on the limited search scope of thirty candidates, MILR appears to occupy a distinct position combining test-time adaptability with unified multimodal latent reasoning. The absence of refutable candidates among examined papers suggests novelty within the sampled literature, though the analysis does not claim exhaustive coverage of all potentially relevant prior work in latent reasoning or inference-time optimization.

Taxonomy

Core-task Taxonomy Papers
23
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Multimodal image generation via test-time latent reasoning. This emerging field explores how generative models can perform deliberate reasoning steps in latent space during inference to produce higher-quality or more controllable images. The taxonomy reveals several complementary directions: Latent-Space Reasoning Mechanisms investigate how models manipulate internal representations to refine outputs, while Visual Thought Generation and Manipulation focuses on creating intermediate visual or symbolic sketches that guide synthesis. Inference-Time Optimization and Scaling examines computational strategies for iterative refinement, and Chain-of-Thought and Contextual Reasoning adapts language-model-style step-by-step planning to visual domains. Meanwhile, Semantic Guidance and Correction addresses error detection and attribute alignment, Evaluation and Reward Modeling develops metrics to steer generation quality, Cross-Modal Information Transfer and Hierarchical Modeling handles integration across modalities, and Specialized Application Domains targets concrete use cases such as video or personalized content creation. Representative works like Continuous Thought[6] and ImageGen CoT[5] illustrate how reasoning can be embedded at test time, while Latent Sketchpad[3] demonstrates intermediate visual planning. A particularly active line of work centers on unified multimodal latent reasoning, where models jointly process text and image signals within a shared latent space to enable flexible test-time adjustments. MILR[0] exemplifies this approach by performing reasoning directly in the latent manifold, closely aligning with neighbors such as Monet[13], which also emphasizes cross-modal latent integration, and Dynamic Multimodal Interleaving[18], which explores alternating modalities during inference. In contrast, Reasoning in Dark[19] investigates reasoning without explicit intermediate outputs, highlighting a trade-off between interpretability and efficiency. Across these branches, open questions persist around scaling inference-time computation (as explored in Inference Time Scaling[8]), balancing semantic correctness with creative flexibility (addressed by Semantic Correction[15] and Customized Reward Models[14]), and generalizing latent reasoning to diverse application domains like video synthesis (Agentic VJ System[20]) or personalized generation (Similar Subject Generation[16]). MILR[0] sits within the core latent reasoning cluster, distinguished by its emphasis on test-time adaptability and multimodal unification, offering a middle ground between purely optimization-driven methods and explicit chain-of-thought pipelines.

Claimed Contributions

MILR: test-time latent reasoning method for multimodal image generation

The authors introduce MILR, a test-time optimization method that performs joint image-text reasoning by searching over continuous vector representations of discrete image and text tokens in a shared latent space, rather than reasoning over raw images and text explicitly.

10 retrieved papers
Policy gradient optimization for unified latent space reasoning

The authors implement latent reasoning using the REINFORCE policy gradient algorithm, where gradients are back-propagated only to intermediate model outputs (latent representations) without modifying model parameters, enabling test-time optimization guided by a reward model.

10 retrieved papers
Instantiation within unified multimodal understanding and generation framework

The authors build MILR on top of MUG models that support language reasoning before image generation, using the intermediate model outputs as the unified latent space to enable cross-modal reasoning entirely at test time.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MILR: test-time latent reasoning method for multimodal image generation

The authors introduce MILR, a test-time optimization method that performs joint image-text reasoning by searching over continuous vector representations of discrete image and text tokens in a shared latent space, rather than reasoning over raw images and text explicitly.

Contribution

Policy gradient optimization for unified latent space reasoning

The authors implement latent reasoning using the REINFORCE policy gradient algorithm, where gradients are back-propagated only to intermediate model outputs (latent representations) without modifying model parameters, enabling test-time optimization guided by a reward model.

Contribution

Instantiation within unified multimodal understanding and generation framework

The authors build MILR on top of MUG models that support language reasoning before image generation, using the intermediate model outputs as the unified latent space to enable cross-modal reasoning entirely at test time.