ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding

ICLR 2026 Conference SubmissionAnonymous Authors
Omni-modal large language modelstraining-free guidance decodinglanguage model reasoning
Abstract:

Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ThinkOmni proposes a training-free framework that enhances omni-modal reasoning by leveraging large reasoning models to guide decoding in omni-modal large language models. The paper resides in the 'Reward-Guided and Process-Supervised Decoding' leaf, which contains only three papers total, including ThinkOmni itself. This represents a relatively sparse research direction within the broader taxonomy of fifty papers across thirty-six topics, suggesting the specific combination of process supervision and omni-modal reasoning remains an emerging area rather than a saturated one.

The taxonomy tree reveals that ThinkOmni's leaf sits within 'Decoding-Time Guidance Mechanisms,' which encompasses four sibling categories: reward-guided methods, attention-guided approaches, cross-modal contrastive decoding, and uncertainty-guided techniques. Neighboring branches include 'Reasoning Process Enhancement' (focusing on chain-of-thought and stepwise reasoning) and 'Cross-Modal Alignment and Fusion' (addressing architectural integration of modalities). ThinkOmni bridges these directions by applying process-level guidance specifically to omni-modal scenarios, diverging from purely attention-based or contrastive methods that do not incorporate explicit reasoning model supervision.

Among twenty-five candidates examined across three contributions, the framework-level contribution shows two refutable candidates out of ten examined, while the LRM-as-a-Guide strategy found zero refutations among five candidates. The Stepwise Contrastive Scaling module identified two refutable candidates from ten examined. These statistics indicate that within the limited search scope, the LRM-as-a-Guide component appears more distinctive, whereas the overall framework and scaling mechanism encounter some overlapping prior work. The modest candidate pool suggests the analysis captures top semantic matches rather than exhaustive field coverage.

Based on the limited literature search of twenty-five candidates, ThinkOmni demonstrates partial novelty, particularly in its guidance strategy, while operating in a sparsely populated taxonomy leaf. The analysis does not cover the full breadth of multi-modal reasoning literature, focusing instead on top semantic matches and citation-expanded candidates. The framework's positioning at the intersection of process supervision and omni-modal reasoning appears less explored than adjacent attention-based or contrastive decoding methods.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: omni-modal reasoning enhancement via guidance decoding. The field centers on improving multi-modal reasoning by steering generation at decode time rather than relying solely on pre-trained representations. The taxonomy reveals several complementary directions: Decoding-Time Guidance Mechanisms explore how to inject external signals—such as reward models or process supervision—directly into token selection; Reasoning Process Enhancement focuses on structured inference steps and chain-of-thought strategies; Cross-Modal Alignment and Fusion addresses the integration of vision, language, and other modalities; Task-Specific Multi-Modal Applications adapt these techniques to domains like visual grounding or medical imaging; In-Context Learning and Demonstration-Based Reasoning leverage few-shot examples; Hallucination Mitigation and Error Correction targets factual consistency; Adversarial and Robustness Analysis examines model vulnerabilities; and Auxiliary Multi-Modal Tasks support reasoning through complementary objectives. Representative works such as PRM-BAS[1] and Reward-Guided Decoding[4] illustrate how process-level rewards can refine generation, while methods like Cross-Modal Attention[10] and Image Token Attention[5] highlight alignment strategies. A particularly active line of work involves reward-guided and process-supervised decoding, where models use step-by-step feedback to navigate complex reasoning chains. ThinkOmni[0] sits squarely in this branch, emphasizing process supervision to guide omni-modal outputs, much like PRM-BAS[1] and Reward-Guided Decoding[4] which also leverage intermediate rewards to steer generation. In contrast, approaches such as Attention-Guided Ensemble[3] and Visual Contrastive Decoding[24] focus more on attention mechanisms and contrastive signals to align modalities, trading off explicit reward modeling for implicit cross-modal coherence. Open questions persist around the scalability of process supervision, the generalization of guidance signals across diverse tasks, and the interplay between decoding-time interventions and pre-training objectives. ThinkOmni[0] contributes to this landscape by integrating process-level guidance with omni-modal inputs, positioning itself alongside recent reward-based methods while addressing the broader challenge of coherent multi-modal reasoning.

Claimed Contributions

ThinkOmni training-free framework for omni-modal reasoning

The authors introduce ThinkOmni, a framework that enables omni-modal reasoning without additional training by leveraging off-the-shelf Large Reasoning Models to guide Omni-modal Large Language Models during decoding, addressing limitations of data scarcity and high computational costs in existing approaches.

10 retrieved papers
Can Refute
LRM-as-a-Guide strategy

The authors propose a strategy that uses Large Reasoning Models as guiding components during decoding to inject advanced reasoning capabilities into Omni-modal Large Language Models through logit-level contrastive mixing, enabling collaboration between perception and reasoning.

5 retrieved papers
Stepwise Contrastive Scaling module

The authors develop a module that dynamically adjusts guidance parameters at each decoding step by analyzing real-time model predictions using Jensen-Shannon divergence, automatically balancing perceptual and reasoning signals across different tasks without requiring manual tuning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ThinkOmni training-free framework for omni-modal reasoning

The authors introduce ThinkOmni, a framework that enables omni-modal reasoning without additional training by leveraging off-the-shelf Large Reasoning Models to guide Omni-modal Large Language Models during decoding, addressing limitations of data scarcity and high computational costs in existing approaches.

Contribution

LRM-as-a-Guide strategy

The authors propose a strategy that uses Large Reasoning Models as guiding components during decoding to inject advanced reasoning capabilities into Omni-modal Large Language Models through logit-level contrastive mixing, enabling collaboration between perception and reasoning.

Contribution

Stepwise Contrastive Scaling module

The authors develop a module that dynamically adjusts guidance parameters at each decoding step by analyzing real-time model predictions using Jensen-Shannon divergence, automatically balancing perceptual and reasoning signals across different tasks without requiring manual tuning.