Thyme: Think Beyond Images

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

MLLMAgenticThink with imagesCoding

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (OpenAI O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code.

In this paper, we make a preliminary attempt in this direction by introducing \textbf{Thyme} (\textbf{Th}ink Be\textbf{y}ond I\textbf{m}ag\textbf{e}s), a novel paradigm for enabling multimodal large language models to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code (Figure 2). This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement), but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial Supervised Fine-Tuning (SFT) on a curated dataset of 500K samples to teach code generation, followed by a Reinforcement Learning (RL) phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose \textbf{GRPO-ATS} (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. As shown in Figure 1, comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Thyme, a paradigm enabling multimodal large language models to autonomously generate and execute code for diverse image manipulations and mathematical computations during reasoning. Within the taxonomy, it resides in the 'Autonomous Image Manipulation via Code Generation' leaf under 'Code-Driven Visual Reasoning and Manipulation'. This leaf contains only two papers total, including one sibling work, indicating a relatively sparse and emerging research direction. The taxonomy shows eleven papers across the entire field, with this particular branch representing a focused subset exploring code-driven visual reasoning.

The taxonomy reveals three main branches: Code-Driven Visual Reasoning, Visual Information Grounding, and Synthetic Data Generation. Thyme's leaf sits within the first branch, which also includes a sibling category on Embodied Agent Code Synthesis for robotic control. Neighboring branches address visual-to-textual conversion, multi-image grounding, and agentic tool integration, all excluding code generation approaches. The scope notes clarify that Thyme's focus on autonomous image manipulation via code distinguishes it from static visual grounding methods and from embodied agent frameworks that target physical control rather than image processing.

Among thirty candidates examined, the core Thyme paradigm (Contribution A) shows no clear refutation across ten candidates reviewed, suggesting relative novelty in its specific formulation of autonomous code-based manipulation. However, the two-stage training strategy (Contribution B) encountered four refutable candidates among ten examined, and the GRPO-ATS algorithm (Contribution C) found three refutable candidates among ten. These statistics indicate that while the overall paradigm appears less explored, the training methodology and reinforcement learning components draw on more established techniques within the limited search scope.

Based on the top-thirty semantic matches examined, Thyme appears to occupy a sparsely populated research direction with limited direct prior work in its specific paradigm. The analysis covers a focused subset of the literature rather than an exhaustive survey, and the taxonomy structure suggests this is an emerging area with room for further exploration. The training and algorithmic contributions show more overlap with existing methods than the core autonomous manipulation framework.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multimodal reasoning with autonomous code generation for image manipulation. The field centers on enabling models to understand visual content and produce executable code that performs targeted image transformations. The taxonomy reveals three main branches: Code-Driven Visual Reasoning and Manipulation focuses on systems that generate and execute code to modify images based on high-level instructions; Visual Information Grounding and Interpretation emphasizes extracting structured knowledge from images to inform reasoning; and Synthetic Data Generation for Multimodal Training addresses the creation of large-scale datasets to train these multimodal systems. Works such as Hycodepolicy[1] and Thinking with Images[2] illustrate how code generation can bridge perception and action, while methods like Multimodal Self-Instruct[5] and InstructFlow[6] demonstrate scalable data synthesis strategies that support model training across diverse manipulation scenarios. Within the Code-Driven Visual Reasoning and Manipulation branch, a particularly active line of work explores autonomous image manipulation via code generation, where models must reason about visual content and produce executable scripts for editing. Thyme[0] sits squarely in this cluster, emphasizing end-to-end code synthesis for image transformations. It shares conceptual ground with Migician[3], which also targets code-based editing, and Seeing is Fixing[4], which integrates visual feedback loops into the generation process. Compared to these neighbors, Thyme[0] appears to prioritize fully autonomous workflows that minimize human intervention, whereas Seeing is Fixing[4] leans more heavily on iterative refinement guided by intermediate visual outputs. A central open question across these efforts is how to balance the expressiveness of generated code with the reliability of execution, especially when handling complex or ambiguous user instructions that require nuanced visual understanding.

Claimed Contributions

Thyme paradigm for autonomous code-based image manipulation and computation

10 retrieved papers

The authors introduce Thyme, a paradigm enabling multimodal large language models to autonomously generate and execute diverse image processing operations and mathematical computations via code. This approach supports operations like cropping, rotation, and contrast enhancement while maintaining high autonomy in deciding when and how to apply these operations.

10 retrieved papers

Two-stage training strategy with SFT and RL phases

Can Refute

10 retrieved papers

The authors develop a two-stage training approach where supervised fine-tuning on 500K curated samples teaches code generation for image operations and computations, followed by reinforcement learning to refine the model's decision-making capabilities. The SFT stage requires only 200 GPU hours to activate fundamental abilities.

10 retrieved papers

Can Refute

GRPO-ATS algorithm with adaptive temperature sampling

Can Refute

10 retrieved papers

The authors propose GRPO-ATS, a reinforcement learning algorithm that uses adaptive temperature sampling: temperature 0 for code generation to ensure determinism and execution validity, and temperature 1 for natural language reasoning to encourage diverse exploration. This balances creative reasoning with accurate, executable code.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers PDF

Su, Zhaochen, Xia, Peng, Guo Hangyu, Liu Zhenhua, Ma Yan, Qu, Xiaoye, Liu Jiaqi, LI Yanshu, Yang, Zhengyuan, Li, Linjie, Cheng Yu, Ji, Heng, He, Junxian, Fung, Yi R. (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Thyme paradigm for autonomous code-based image manipulation and computation

[32] Propose and rectify: A Forensics-Driven MLLM framework for image manipulation Localization PDF

Cannot Refute

[33] Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models PDF

Cannot Refute

[34] OmniGen2: Exploration to Advanced Multimodal Generation PDF

Cannot Refute

[35] Design2code: Benchmarking multimodal code generation for automated front-end engineering PDF

Cannot Refute

[36] Multimodal unsupervised image-to-image translation PDF

Cannot Refute

[37] Self-training large language models for improved visual program synthesis with visual reinforcement PDF

Cannot Refute

[38] VIMA: General Robot Manipulation with Multimodal Prompts PDF

Cannot Refute

[39] Neural program synthesis for automatic image enhancement PDF

Cannot Refute

[40] Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing PDF

Cannot Refute

[41] Roboscript: Code generation for free-form manipulation tasks across real and simulation PDF

Cannot Refute

Contribution

Two-stage training strategy with SFT and RL phases

[25] Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models PDF

Can Refute

[26] Execution-based Code Generation using Deep Reinforcement Learning PDF

Can Refute

[27] CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning PDF

Can Refute

[28] Generating refactored code accurately using reinforcement learning PDF

Can Refute

[22] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs PDF

Cannot Refute

[23] Process-Supervised Reinforcement Learning for Code Generation PDF

Cannot Refute

[24] Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation PDF

Cannot Refute

[29] Selfcodealign: Self-alignment for code generation PDF

Cannot Refute

[30] Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving PDF

Cannot Refute

[31] Direct preference optimization: Your language model is secretly a reward model PDF

Cannot Refute

Contribution

GRPO-ATS algorithm with adaptive temperature sampling

[12] Hot or cold? adaptive temperature sampling for code generation with large language models PDF

Can Refute

[16] Exploring Multi-Temperature Strategies for Token-and Rollout-Level Control in RLVR PDF

Can Refute

[20] Improving Code Generation by Dynamic Temperature Sampling PDF

Can Refute

[13] Anncoder: A mti-agent-based code generation and optimization model PDF

Cannot Refute

[14] To cool or not to cool? temperature network meets large foundation models via dro PDF

Cannot Refute

[15] -DPO: Direct Preference Optimization with Dynamic PDF

Cannot Refute

[17] Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning PDF

Cannot Refute

[18] Dynamic plan generation with LLMs: automatic execution of abstract BDI-agent goals PDF

Cannot Refute

[19] Calibrating Language Models with Adaptive Temperature Scaling PDF

Cannot Refute

[21] RL-TweetGen: A Socio-Technical Framework for Engagement-Optimized Short Text Generation in Digital Commerce Using Large Language Models and Reinforcement Learning PDF

Cannot Refute

Thyme: Think Beyond Images

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers PDF

Contribution Analysis

Thyme paradigm for autonomous code-based image manipulation and computation

[32] Propose and rectify: A Forensics-Driven MLLM framework for image manipulation Localization PDF

[33] Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models PDF

[34] OmniGen2: Exploration to Advanced Multimodal Generation PDF

[35] Design2code: Benchmarking multimodal code generation for automated front-end engineering PDF

[36] Multimodal unsupervised image-to-image translation PDF

[37] Self-training large language models for improved visual program synthesis with visual reinforcement PDF

[38] VIMA: General Robot Manipulation with Multimodal Prompts PDF

[39] Neural program synthesis for automatic image enhancement PDF

[40] Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing PDF

[41] Roboscript: Code generation for free-form manipulation tasks across real and simulation PDF

Two-stage training strategy with SFT and RL phases

[25] Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models PDF

[26] Execution-based Code Generation using Deep Reinforcement Learning PDF

[27] CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning PDF

[28] Generating refactored code accurately using reinforcement learning PDF

[22] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs PDF

[23] Process-Supervised Reinforcement Learning for Code Generation PDF

[24] Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation PDF

[29] Selfcodealign: Self-alignment for code generation PDF

[30] Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving PDF

[31] Direct preference optimization: Your language model is secretly a reward model PDF

GRPO-ATS algorithm with adaptive temperature sampling

[12] Hot or cold? adaptive temperature sampling for code generation with large language models PDF

[16] Exploring Multi-Temperature Strategies for Token-and Rollout-Level Control in RLVR PDF

[20] Improving Code Generation by Dynamic Temperature Sampling PDF

[13] Anncoder: A mti-agent-based code generation and optimization model PDF

[14] To cool or not to cool? temperature network meets large foundation models via dro PDF

[15] -DPO: Direct Preference Optimization with Dynamic PDF

[17] Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning PDF

[18] Dynamic plan generation with LLMs: automatic execution of abstract BDI-agent goals PDF

[19] Calibrating Language Models with Adaptive Temperature Scaling PDF

[21] RL-TweetGen: A Socio-Technical Framework for Engagement-Optimized Short Text Generation in Digital Commerce Using Large Language Models and Reinforcement Learning PDF

Table of Contents