SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Efficient ReasoningLarge Multimodal Models

Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SketchThinker-R1, a framework that trains large multimodal models to perform concise, goal-directed reasoning inspired by human cognitive efficiency. It resides in the 'Efficient Reasoning Process Optimization' leaf of the taxonomy, which currently contains only this work as a sibling. This positioning indicates a relatively sparse research direction explicitly focused on optimizing reasoning efficiency through sketch-style processes, distinguishing it from the more populated branches addressing visual reasoning mechanisms or sketch-based interaction systems.

The taxonomy reveals several neighboring directions. 'Visual Reasoning Enhancement via Sketch-Based Mechanisms' (four papers across three leaves) explores explicit sketch generation for reasoning, while 'Latent Visual Reasoning and Internal Representation Mechanisms' (two papers) examines internal feature-space operations. The paper's emphasis on efficiency bridges these areas: unlike Visual Sketchpad or Interwoven Thinking Drawing, which prioritize interpretability through visible sketches, SketchThinker-R1 aims to reduce token costs while retaining structured reasoning benefits. Its scope_note clarifies it targets 'concise, goal-directed cognitive processes,' excluding methods focused solely on visual mechanisms without efficiency optimization.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core SketchThinker-R1 framework (Contribution A) examined ten candidates with zero refutations, suggesting limited direct prior work on reinforcement learning for sketch-style reasoning efficiency. However, the SketchJudge reward model (Contribution B) found one refutable candidate among ten examined, and the token reduction claim (Contribution C) identified two refutable candidates among ten, indicating that reward modeling for reasoning style and efficiency metrics have more substantial overlapping prior work within the limited search scope.

Based on the thirty candidates examined, the framework appears to occupy a genuinely sparse research direction, though the reward modeling and efficiency evaluation components show clearer precedents. The analysis covers top-K semantic matches and does not constitute an exhaustive literature review. The taxonomy structure confirms that explicit efficiency optimization in multimodal reasoning remains underexplored compared to visual reasoning mechanisms or sketch-based interaction paradigms, though the limited search scope prevents definitive claims about absolute novelty.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Efficient sketch-style reasoning in large multimodal models. The field has evolved around several complementary directions that explore how sketch-like intermediate representations can enhance visual reasoning. Visual Reasoning Enhancement via Sketch-Based Mechanisms investigates explicit drawing or annotation steps to guide model inference, while Latent Visual Reasoning and Internal Representation Mechanisms examines how models can perform similar operations in hidden feature spaces without rendering visible sketches. Efficient Reasoning Process Optimization focuses on reducing computational overhead and streamlining inference pipelines, often by integrating sketch-based planning or iterative refinement. Meanwhile, Global Visual Reasoning and Connectivity Tasks address problems requiring holistic scene understanding, Sketch-Based Multimodal Interaction and Input Interfaces explore user-driven sketch inputs, and Sketch Understanding and Representation Learning develops encoders for sketch data. Sketch-Conditioned Generation and Synthesis targets creative applications, Multimodal Applications and Domain-Specific Systems applies these ideas to specialized domains, and General Visual Perception and Reasoning Foundations provides broader architectural and perceptual underpinnings. Recent work has highlighted trade-offs between explicit intermediate visualizations and purely latent reasoning. Visual Sketchpad[1] and Interwoven Thinking Drawing[2] demonstrate that generating visible sketches can improve interpretability and multi-step planning, yet such approaches may incur rendering costs. In contrast, Latent Sketchpad[6] operates entirely in feature space to avoid these overheads. SketchThinker[0] sits within the Efficient Reasoning Process Optimization branch, emphasizing streamlined inference while retaining sketch-style intermediate steps. Compared to Visual Planning[3], which focuses on action-oriented reasoning, SketchThinker[0] prioritizes computational efficiency without sacrificing the benefits of structured visual reasoning. Nearby works like ChartSketcher[5] and GeoSketch[7] apply similar principles to domain-specific tasks, illustrating how sketch-based reasoning can be tailored to specialized visual contexts while maintaining efficiency.

Claimed Contributions

SketchThinker-R1 reinforcement learning framework for sketch-style reasoning

10 retrieved papers

The authors introduce a three-stage framework that trains large multimodal models to produce concise, sketch-style reasoning chains. The framework includes sketch-mode cold start, a SketchJudge reward model, and sketch-thinking reinforcement learning to reduce computational overhead while maintaining accuracy.

10 retrieved papers

SketchJudge reward model for evaluating reasoning style

Can Refute

10 retrieved papers

The authors develop a specialized reward model that evaluates reasoning traces and favors concise sketch-style reasoning over verbose explanations. This model provides supervisory signals during reinforcement learning to guide the development of efficient reasoning patterns.

10 retrieved papers

Can Refute

Over 64% reduction in reasoning token cost without accuracy loss

Can Refute

10 retrieved papers

The authors demonstrate that their framework achieves substantial efficiency gains by reducing the number of tokens required for reasoning by more than 64% across four benchmarks while maintaining or improving answer accuracy compared to standard reasoning models.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SketchThinker-R1 reinforcement learning framework for sketch-style reasoning

[42] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

Cannot Refute

[43] Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language model PDF

Cannot Refute

[44] VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model PDF

Cannot Refute

[45] Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models PDF

Cannot Refute

[46] Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning PDF

Cannot Refute

[47] Vla-r1: Enhancing reasoning in vision-language-action models PDF

Cannot Refute

[48] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF

Cannot Refute

[49] GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning PDF

Cannot Refute

[50] Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models PDF

Cannot Refute

[51] MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models PDF

Cannot Refute

Contribution

SketchJudge reward model for evaluating reasoning style

[28] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization PDF

Can Refute

[22] Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models PDF

Cannot Refute

[23] Improve Vision Language Model Chain-of-thought Reasoning PDF

Cannot Refute

[24] Rethinking Reward Models for Multi-Domain Test-Time Scaling PDF

Cannot Refute

[25] Optimizing Length Compression in Large Reasoning Models PDF

Cannot Refute

[26] PixelThink: Towards Efficient Chain-of-Pixel Reasoning PDF

Cannot Refute

[27] Reinforcing video reasoning with focused thinking PDF

Cannot Refute

[29] Generative Reward Modeling via Synthetic Criteria Preference Learning PDF

Cannot Refute

[30] Self-Aligned Reward: Towards Effective and Efficient Reasoners PDF

Cannot Refute

[31] Adaptive Deep Reasoning: Triggering Deep Thinking When Needed PDF

Cannot Refute

Contribution

Over 64% reduction in reasoning token cost without accuracy loss

[35] Token-budget-aware llm reasoning PDF

Can Refute

[37] Harnessing the reasoning economy: A survey of efficient reasoning for large language models PDF

Can Refute

[32] Compressing context to enhance inference efficiency of large language models PDF

Cannot Refute

[33] Batch prompting: Efficient inference with large language model apis PDF

Cannot Refute

[34] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model PDF

Cannot Refute

[36] Flexgen: High-throughput generative inference of large language models with a single gpu PDF

Cannot Refute

[38] High-throughput Generative Inference of Large Language Models with a Single GPU PDF

Cannot Refute

[39] SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference PDF

Cannot Refute

[40] Tabi: An efficient multi-level inference system for large language models PDF

Cannot Refute

[41] Specinfer: Accelerating large language model serving with tree-based speculative inference and verification PDF

Cannot Refute

SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

SketchThinker-R1 reinforcement learning framework for sketch-style reasoning

[42] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

[43] Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language model PDF

[44] VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model PDF

[45] Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models PDF

[46] Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning PDF

[47] Vla-r1: Enhancing reasoning in vision-language-action models PDF

[48] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF

[49] GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning PDF

[50] Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models PDF

[51] MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models PDF

SketchJudge reward model for evaluating reasoning style

[28] HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization PDF

[22] Grpo-lead: A difficulty-aware reinforcement learning approach for concise mathematical reasoning in language models PDF

[23] Improve Vision Language Model Chain-of-thought Reasoning PDF

[24] Rethinking Reward Models for Multi-Domain Test-Time Scaling PDF

[25] Optimizing Length Compression in Large Reasoning Models PDF

[26] PixelThink: Towards Efficient Chain-of-Pixel Reasoning PDF

[27] Reinforcing video reasoning with focused thinking PDF

[29] Generative Reward Modeling via Synthetic Criteria Preference Learning PDF

[30] Self-Aligned Reward: Towards Effective and Efficient Reasoners PDF

[31] Adaptive Deep Reasoning: Triggering Deep Thinking When Needed PDF

Over 64% reduction in reasoning token cost without accuracy loss

[35] Token-budget-aware llm reasoning PDF

[37] Harnessing the reasoning economy: A survey of efficient reasoning for large language models PDF

[32] Compressing context to enhance inference efficiency of large language models PDF

[33] Batch prompting: Efficient inference with large language model apis PDF

[34] DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model PDF

[36] Flexgen: High-throughput generative inference of large language models with a single gpu PDF

[38] High-throughput Generative Inference of Large Language Models with a Single GPU PDF

[39] SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference PDF

[40] Tabi: An efficient multi-level inference system for large language models PDF

[41] Specinfer: Accelerating large language model serving with tree-based speculative inference and verification PDF

Table of Contents