ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Omni-modal large language modelstraining-free guidance decodinglanguage model reasoning

Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ThinkOmni proposes a training-free framework that enhances omni-modal reasoning by leveraging large reasoning models to guide decoding in omni-modal large language models. The paper resides in the 'Reward-Guided and Process-Supervised Decoding' leaf, which contains only three papers total, including ThinkOmni itself. This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting the specific combination of process supervision and omni-modal reasoning remains an emerging area rather than a saturated one.

The taxonomy tree reveals that ThinkOmni's leaf sits within 'Decoding-Time Guidance Mechanisms,' which encompasses four sibling categories: reward-guided methods, attention-guided approaches, cross-modal contrastive decoding, and uncertainty-guided techniques. Neighboring branches include 'Reasoning Process Enhancement' (focusing on chain-of-thought and stepwise reasoning) and 'Cross-Modal Alignment and Fusion' (addressing architectural integration of modalities). ThinkOmni bridges these directions by applying process-level guidance specifically to omni-modal scenarios, diverging from purely attention-based or contrastive methods that do not incorporate explicit reasoning model supervision.

Among twenty-five candidates examined across three contributions, the framework-level contribution shows two refutable candidates out of ten examined, while the LRM-as-a-Guide strategy found zero refutations among five candidates. The Stepwise Contrastive Scaling module identified two refutable candidates from ten examined. These statistics indicate that within the limited search scope, the LRM-as-a-Guide component appears more distinctive, whereas the overall framework and scaling mechanism encounter some overlapping prior work. The modest candidate pool suggests the analysis captures top semantic matches rather than exhaustive field coverage.

Based on the limited literature search of twenty-five candidates, ThinkOmni demonstrates partial novelty, particularly in its guidance strategy, while operating in a sparsely populated taxonomy leaf. The analysis does not cover the full breadth of multi-modal reasoning literature, focusing instead on top semantic matches and citation-expanded candidates. The framework's positioning at the intersection of process supervision and omni-modal reasoning appears less explored than adjacent attention-based or contrastive decoding methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: omni-modal reasoning enhancement via guidance decoding. The field centers on improving multi-modal reasoning by steering generation at decode time rather than relying solely on pre-trained representations. The taxonomy reveals several complementary directions: Decoding-Time Guidance Mechanisms explore how to inject external signals—such as reward models or process supervision—directly into token selection; Reasoning Process Enhancement focuses on structured inference steps and chain-of-thought strategies; Cross-Modal Alignment and Fusion addresses the integration of vision, language, and other modalities; Task-Specific Multi-Modal Applications adapt these techniques to domains like visual grounding or medical imaging; In-Context Learning and Demonstration-Based Reasoning leverage few-shot examples; Hallucination Mitigation and Error Correction targets factual consistency; Adversarial and Robustness Analysis examines model vulnerabilities; and Auxiliary Multi-Modal Tasks support reasoning through complementary objectives. Representative works such as PRM-BAS[1] and Reward-Guided Decoding[4] illustrate how process-level rewards can refine generation, while methods like Cross-Modal Attention[10] and Image Token Attention[5] highlight alignment strategies. A particularly active line of work involves reward-guided and process-supervised decoding, where models use step-by-step feedback to navigate complex reasoning chains. ThinkOmni[0] sits squarely in this branch, emphasizing process supervision to guide omni-modal outputs, much like PRM-BAS[1] and Reward-Guided Decoding[4] which also leverage intermediate rewards to steer generation. In contrast, approaches such as Attention-Guided Ensemble[3] and Visual Contrastive Decoding[24] focus more on attention mechanisms and contrastive signals to align modalities, trading off explicit reward modeling for implicit cross-modal coherence. Open questions persist around the scalability of process supervision, the generalization of guidance signals across diverse tasks, and the interplay between decoding-time interventions and pre-training objectives. ThinkOmni[0] contributes to this landscape by integrating process-level guidance with omni-modal inputs, positioning itself alongside recent reward-based methods while addressing the broader challenge of coherent multi-modal reasoning.

Claimed Contributions

ThinkOmni training-free framework for omni-modal reasoning

Can Refute

10 retrieved papers

The authors introduce ThinkOmni, a framework that enables omni-modal reasoning without additional training by leveraging off-the-shelf Large Reasoning Models to guide Omni-modal Large Language Models during decoding, addressing limitations of data scarcity and high computational costs in existing approaches.

10 retrieved papers

Can Refute

LRM-as-a-Guide strategy

5 retrieved papers

The authors propose a strategy that uses Large Reasoning Models as guiding components during decoding to inject advanced reasoning capabilities into Omni-modal Large Language Models through logit-level contrastive mixing, enabling collaboration between perception and reasoning.

5 retrieved papers

Stepwise Contrastive Scaling module

Can Refute

10 retrieved papers

The authors develop a module that dynamically adjusts guidance parameters at each decoding step by analyzing real-time model predictions using Jensen-Shannon divergence, automatically balancing perceptual and reasoning signals across different tasks without requiring manual tuning.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search PDF

Hu, Pengfei, Zhang Zhenrong, Pengfei Hu, Zhenrong Zhang, Liu, Shuhang, Qikai Chang, Ma JieFeng, Shuhang Liu, Du Jun, Jie Ma, Zhang, Jianshu, Jun Du, Liu Quan, Jianshu Zhang, Gao Jian-Qing, Quan Liu, Ma Feng, Jianqing Gao, Liu Qing-feng, Feng Ma, Qingfeng Liu (2025) • arXiv.org

[4] Controlling Multimodal LLMs via Reward-guided Decoding PDF

MaÃ±as, Oscar, D'Oro, Pierluca, Oscar MaÃ±as, Sinha, Koustuv, P. D'Oro, Romero-Soriano, Adriana, Koustuv Sinha, Drozdzal, Michal, Adriana Romero-Soriano, Agrawal, Aishwarya, M. Drozdzal, Aishwarya Agrawal (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ThinkOmni training-free framework for omni-modal reasoning

[51] Socratic models: Composing zero-shot multimodal reasoning with language PDF

Can Refute

[54] Training-Free Reasoning and Reflection in MLLMs PDF

Can Refute

[52] See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model PDF

Cannot Refute

[53] MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning PDF

Cannot Refute

[55] Freeret: Mllms as training-free retrievers PDF

Cannot Refute

[56] Vision-by-language for training-free compositional image retrieval PDF

Cannot Refute

[57] E-FreeM2: Efficient Training-Free Multi-Scale and Cross-Modal News Verification via MLLMs PDF

Cannot Refute

[58] Training-Free Multimodal Large Language Model Orchestration PDF

Cannot Refute

[59] Multimodal PEAR chain-of-thought reasoning for multimodal sentiment analysis PDF

Cannot Refute

[60] SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning PDF

Cannot Refute

Contribution

LRM-as-a-Guide strategy

[71] Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning PDF

Cannot Refute

[72] Moment and Highlight Detection via MLLM Frame Segmentation PDF

Cannot Refute

[73] Dark Side of Modalities: Reinforced Multimodal Distillation for Multimodal Knowledge Graph Reasoning PDF

Cannot Refute

[74] Multimodal Reasoning with Fine-grained Knowledge Representation PDF

Cannot Refute

[75] Object-Guided Visual Tokens: Eliciting Compositional Reasoning in Multimodal Language Models PDF

Cannot Refute

Contribution

Stepwise Contrastive Scaling module

[63] AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge PDF

Can Refute

[66] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs PDF

Can Refute

[61] OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning PDF

Cannot Refute

[62] It's never too late: Fusing acoustic information into large language models for automatic speech recognition PDF

Cannot Refute

[64] The multimodal MRI brain tumor segmentation based on AD-Net PDF

Cannot Refute

[65] Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers PDF

Cannot Refute

[67] Robust Divergence Learning for Missing-Modality Segmentation PDF

Cannot Refute

[68] See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias PDF

Cannot Refute

[69] Deep multimodal distance metric learning using click constraints for image ranking PDF

Cannot Refute

[70] LDW: Label Divergence Weighting for Multimodal Sentiment Analysis PDF

Cannot Refute

ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] PRM-BAS: Enhancing Multimodal Reasoning through PRM-guided Beam Annealing Search PDF

[4] Controlling Multimodal LLMs via Reward-guided Decoding PDF

Contribution Analysis

ThinkOmni training-free framework for omni-modal reasoning

[51] Socratic models: Composing zero-shot multimodal reasoning with language PDF

[54] Training-Free Reasoning and Reflection in MLLMs PDF

[52] See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model PDF

[53] MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning PDF

[55] Freeret: Mllms as training-free retrievers PDF

[56] Vision-by-language for training-free compositional image retrieval PDF

[57] E-FreeM2: Efficient Training-Free Multi-Scale and Cross-Modal News Verification via MLLMs PDF

[58] Training-Free Multimodal Large Language Model Orchestration PDF

[59] Multimodal PEAR chain-of-thought reasoning for multimodal sentiment analysis PDF

[60] SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning PDF

LRM-as-a-Guide strategy

[71] Enhancing Multimodal Large Language Models: From Multimodal Alignment, Fine-Grained Perception to Robust Reasoning PDF

[72] Moment and Highlight Detection via MLLM Frame Segmentation PDF

[73] Dark Side of Modalities: Reinforced Multimodal Distillation for Multimodal Knowledge Graph Reasoning PDF

[74] Multimodal Reasoning with Fine-grained Knowledge Representation PDF

[75] Object-Guided Visual Tokens: Eliciting Compositional Reasoning in Multimodal Language Models PDF

Stepwise Contrastive Scaling module

[63] AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge PDF

[66] MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs PDF

[61] OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning PDF

[62] It's never too late: Fusing acoustic information into large language models for automatic speech recognition PDF

[64] The multimodal MRI brain tumor segmentation based on AD-Net PDF

[65] Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers PDF

[67] Robust Divergence Learning for Missing-Modality Segmentation PDF

[68] See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias PDF

[69] Deep multimodal distance metric learning using click constraints for image ranking PDF

[70] LDW: Label Divergence Weighting for Multimodal Sentiment Analysis PDF

Table of Contents