SketchThinker-R1: Towards Efficient Sketch-Style Reasoning in Large Multimodal Models

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient ReasoningLarge Multimodal Models
Abstract:

Despite the empirical success of extensive, step-by-step reasoning in large multimodal models, long reasoning processes inevitably incur substantial computational overhead, i.e., in terms of higher token costs and increased response time, which undermines inference efficiency. In contrast, humans often employ sketch-style reasoning: a concise, goal-directed cognitive process that prioritizes salient information and enables efficient problem-solving. Inspired by this cognitive efficiency, we propose SketchThinker-R1, which incentivizes sketch-style reasoning ability in large multimodal models. Our method consists of three primary stages. In the Sketch-Mode Cold Start stage, we convert standard long reasoning process into sketch-style reasoning and finetune base multimodal model, instilling initial sketch-style reasoning capability. Next, we train SketchJudge Reward Model, which explicitly evaluates thinking process of model and assigns higher scores to sketch-style reasoning. Finally, we conduct Sketch-Thinking Reinforcement Learning under supervision of SketchJudge to further generalize sketch-style reasoning ability. Experimental evaluation on four benchmarks reveals that our SketchThinker-R1 achieves over 64% reduction in reasoning token cost without compromising final answer accuracy. Qualitative analysis further shows that sketch-style reasoning focuses more on key cues during problem solving.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SketchThinker-R1, a framework that trains large multimodal models to perform concise, goal-directed reasoning inspired by human cognitive efficiency. It resides in the 'Efficient Reasoning Process Optimization' leaf of the taxonomy, which currently contains only this work as a sibling. This positioning indicates a relatively sparse research direction explicitly focused on optimizing reasoning efficiency through sketch-style processes, distinguishing it from the more populated branches addressing visual reasoning mechanisms or sketch-based interaction systems.

The taxonomy reveals several neighboring directions. 'Visual Reasoning Enhancement via Sketch-Based Mechanisms' (four papers across three leaves) explores explicit sketch generation for reasoning, while 'Latent Visual Reasoning and Internal Representation Mechanisms' (two papers) examines internal feature-space operations. The paper's emphasis on efficiency bridges these areas: unlike Visual Sketchpad or Interwoven Thinking Drawing, which prioritize interpretability through visible sketches, SketchThinker-R1 aims to reduce token costs while retaining structured reasoning benefits. Its scope_note clarifies it targets 'concise, goal-directed cognitive processes,' excluding methods focused solely on visual mechanisms without efficiency optimization.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core SketchThinker-R1 framework (Contribution A) examined ten candidates with zero refutations, suggesting limited direct prior work on reinforcement learning for sketch-style reasoning efficiency. However, the SketchJudge reward model (Contribution B) found one refutable candidate among ten examined, and the token reduction claim (Contribution C) identified two refutable candidates among ten, indicating that reward modeling for reasoning style and efficiency metrics have more substantial overlapping prior work within the limited search scope.

Based on the thirty candidates examined, the framework appears to occupy a genuinely sparse research direction, though the reward modeling and efficiency evaluation components show clearer precedents. The analysis covers top-K semantic matches and does not constitute an exhaustive literature review. The taxonomy structure confirms that explicit efficiency optimization in multimodal reasoning remains underexplored compared to visual reasoning mechanisms or sketch-based interaction paradigms, though the limited search scope prevents definitive claims about absolute novelty.

Taxonomy

Core-task Taxonomy Papers
21
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Efficient sketch-style reasoning in large multimodal models. The field has evolved around several complementary directions that explore how sketch-like intermediate representations can enhance visual reasoning. Visual Reasoning Enhancement via Sketch-Based Mechanisms investigates explicit drawing or annotation steps to guide model inference, while Latent Visual Reasoning and Internal Representation Mechanisms examines how models can perform similar operations in hidden feature spaces without rendering visible sketches. Efficient Reasoning Process Optimization focuses on reducing computational overhead and streamlining inference pipelines, often by integrating sketch-based planning or iterative refinement. Meanwhile, Global Visual Reasoning and Connectivity Tasks address problems requiring holistic scene understanding, Sketch-Based Multimodal Interaction and Input Interfaces explore user-driven sketch inputs, and Sketch Understanding and Representation Learning develops encoders for sketch data. Sketch-Conditioned Generation and Synthesis targets creative applications, Multimodal Applications and Domain-Specific Systems applies these ideas to specialized domains, and General Visual Perception and Reasoning Foundations provides broader architectural and perceptual underpinnings. Recent work has highlighted trade-offs between explicit intermediate visualizations and purely latent reasoning. Visual Sketchpad[1] and Interwoven Thinking Drawing[2] demonstrate that generating visible sketches can improve interpretability and multi-step planning, yet such approaches may incur rendering costs. In contrast, Latent Sketchpad[6] operates entirely in feature space to avoid these overheads. SketchThinker[0] sits within the Efficient Reasoning Process Optimization branch, emphasizing streamlined inference while retaining sketch-style intermediate steps. Compared to Visual Planning[3], which focuses on action-oriented reasoning, SketchThinker[0] prioritizes computational efficiency without sacrificing the benefits of structured visual reasoning. Nearby works like ChartSketcher[5] and GeoSketch[7] apply similar principles to domain-specific tasks, illustrating how sketch-based reasoning can be tailored to specialized visual contexts while maintaining efficiency.

Claimed Contributions

SketchThinker-R1 reinforcement learning framework for sketch-style reasoning

The authors introduce a three-stage framework that trains large multimodal models to produce concise, sketch-style reasoning chains. The framework includes sketch-mode cold start, a SketchJudge reward model, and sketch-thinking reinforcement learning to reduce computational overhead while maintaining accuracy.

10 retrieved papers
SketchJudge reward model for evaluating reasoning style

The authors develop a specialized reward model that evaluates reasoning traces and favors concise sketch-style reasoning over verbose explanations. This model provides supervisory signals during reinforcement learning to guide the development of efficient reasoning patterns.

10 retrieved papers
Can Refute
Over 64% reduction in reasoning token cost without accuracy loss

The authors demonstrate that their framework achieves substantial efficiency gains by reducing the number of tokens required for reasoning by more than 64% across four benchmarks while maintaining or improving answer accuracy compared to standard reasoning models.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SketchThinker-R1 reinforcement learning framework for sketch-style reasoning

The authors introduce a three-stage framework that trains large multimodal models to produce concise, sketch-style reasoning chains. The framework includes sketch-mode cold start, a SketchJudge reward model, and sketch-thinking reinforcement learning to reduce computational overhead while maintaining accuracy.

Contribution

SketchJudge reward model for evaluating reasoning style

The authors develop a specialized reward model that evaluates reasoning traces and favors concise sketch-style reasoning over verbose explanations. This model provides supervisory signals during reinforcement learning to guide the development of efficient reasoning patterns.

Contribution

Over 64% reduction in reasoning token cost without accuracy loss

The authors demonstrate that their framework achieves substantial efficiency gains by reducing the number of tokens required for reasoning by more than 64% across four benchmarks while maintaining or improving answer accuracy compared to standard reasoning models.