Thyme: Think Beyond Images

ICLR 2026 Conference SubmissionAnonymous Authors
MLLMAgenticThink with imagesCoding
Abstract:

Following OpenAI's introduction of the ``thinking with images'' concept, recent efforts have explored stimulating the use of visual information in the reasoning process to enhance model performance in perception and reasoning tasks. However, to the best of our knowledge, no open-source work currently offers a feature set as rich as proprietary models (OpenAI O3), which can perform diverse image manipulations and simultaneously enhance logical reasoning capabilities through code.

In this paper, we make a preliminary attempt in this direction by introducing \textbf{Thyme} (\textbf{Th}ink Be\textbf{y}ond I\textbf{m}ag\textbf{e}s), a novel paradigm for enabling multimodal large language models to transcend existing ``think with images'' approaches by autonomously generating and executing diverse image processing and computational operations via executable code (Figure 2). This approach not only facilitates a rich, on-the-fly set of image manipulations (e.g., cropping, rotation, contrast enhancement), but also allows for mathematical computations, all while maintaining high autonomy in deciding when and how to apply these operations. We activate this capability through a two-stage training strategy: an initial Supervised Fine-Tuning (SFT) on a curated dataset of 500K samples to teach code generation, followed by a Reinforcement Learning (RL) phase to refine decision-making. For the RL stage, we manually collect and design high-resolution question-answer pairs to increase the learning difficulty, and we propose \textbf{GRPO-ATS} (Group Relative Policy Optimization with Adaptive Temperature Sampling), an algorithm that applies distinct temperatures to text and code generation to balance reasoning exploration with code execution precision. We conduct extensive experimental analysis and ablation studies. As shown in Figure 1, comprehensive evaluations on nearly 20 benchmarks show that Thyme yields significant and consistent performance gains, particularly in challenging high-resolution perception and complex reasoning tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Thyme, a paradigm enabling multimodal large language models to autonomously generate and execute code for diverse image manipulations and mathematical computations during reasoning. Within the taxonomy, it resides in the 'Autonomous Image Manipulation via Code Generation' leaf under 'Code-Driven Visual Reasoning and Manipulation'. This leaf contains only two papers total, including one sibling work, indicating a relatively sparse and emerging research direction. The taxonomy shows eleven papers across the entire field, with this particular branch representing a focused subset exploring code-driven visual reasoning.

The taxonomy reveals three main branches: Code-Driven Visual Reasoning, Visual Information Grounding, and Synthetic Data Generation. Thyme's leaf sits within the first branch, which also includes a sibling category on Embodied Agent Code Synthesis for robotic control. Neighboring branches address visual-to-textual conversion, multi-image grounding, and agentic tool integration, all excluding code generation approaches. The scope notes clarify that Thyme's focus on autonomous image manipulation via code distinguishes it from static visual grounding methods and from embodied agent frameworks that target physical control rather than image processing.

Among thirty candidates examined, the core Thyme paradigm (Contribution A) shows no clear refutation across ten candidates reviewed, suggesting relative novelty in its specific formulation of autonomous code-based manipulation. However, the two-stage training strategy (Contribution B) encountered four refutable candidates among ten examined, and the GRPO-ATS algorithm (Contribution C) found three refutable candidates among ten. These statistics indicate that while the overall paradigm appears less explored, the training methodology and reinforcement learning components draw on more established techniques within the limited search scope.

Based on the top-thirty semantic matches examined, Thyme appears to occupy a sparsely populated research direction with limited direct prior work in its specific paradigm. The analysis covers a focused subset of the literature rather than an exhaustive survey, and the taxonomy structure suggests this is an emerging area with room for further exploration. The training and algorithmic contributions show more overlap with existing methods than the core autonomous manipulation framework.

Taxonomy

Core-task Taxonomy Papers
11
3
Claimed Contributions
30
Contribution Candidate Papers Compared
7
Refutable Paper

Research Landscape Overview

Core task: Multimodal reasoning with autonomous code generation for image manipulation. The field centers on enabling models to understand visual content and produce executable code that performs targeted image transformations. The taxonomy reveals three main branches: Code-Driven Visual Reasoning and Manipulation focuses on systems that generate and execute code to modify images based on high-level instructions; Visual Information Grounding and Interpretation emphasizes extracting structured knowledge from images to inform reasoning; and Synthetic Data Generation for Multimodal Training addresses the creation of large-scale datasets to train these multimodal systems. Works such as Hycodepolicy[1] and Thinking with Images[2] illustrate how code generation can bridge perception and action, while methods like Multimodal Self-Instruct[5] and InstructFlow[6] demonstrate scalable data synthesis strategies that support model training across diverse manipulation scenarios. Within the Code-Driven Visual Reasoning and Manipulation branch, a particularly active line of work explores autonomous image manipulation via code generation, where models must reason about visual content and produce executable scripts for editing. Thyme[0] sits squarely in this cluster, emphasizing end-to-end code synthesis for image transformations. It shares conceptual ground with Migician[3], which also targets code-based editing, and Seeing is Fixing[4], which integrates visual feedback loops into the generation process. Compared to these neighbors, Thyme[0] appears to prioritize fully autonomous workflows that minimize human intervention, whereas Seeing is Fixing[4] leans more heavily on iterative refinement guided by intermediate visual outputs. A central open question across these efforts is how to balance the expressiveness of generated code with the reliability of execution, especially when handling complex or ambiguous user instructions that require nuanced visual understanding.

Claimed Contributions

Thyme paradigm for autonomous code-based image manipulation and computation

The authors introduce Thyme, a paradigm enabling multimodal large language models to autonomously generate and execute diverse image processing operations and mathematical computations via code. This approach supports operations like cropping, rotation, and contrast enhancement while maintaining high autonomy in deciding when and how to apply these operations.

10 retrieved papers
Two-stage training strategy with SFT and RL phases

The authors develop a two-stage training approach where supervised fine-tuning on 500K curated samples teaches code generation for image operations and computations, followed by reinforcement learning to refine the model's decision-making capabilities. The SFT stage requires only 200 GPU hours to activate fundamental abilities.

10 retrieved papers
Can Refute
GRPO-ATS algorithm with adaptive temperature sampling

The authors propose GRPO-ATS, a reinforcement learning algorithm that uses adaptive temperature sampling: temperature 0 for code generation to ensure determinism and execution validity, and temperature 1 for natural language reasoning to encourage diverse exploration. This balances creative reasoning with accurate, executable code.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Thyme paradigm for autonomous code-based image manipulation and computation

The authors introduce Thyme, a paradigm enabling multimodal large language models to autonomously generate and execute diverse image processing operations and mathematical computations via code. This approach supports operations like cropping, rotation, and contrast enhancement while maintaining high autonomy in deciding when and how to apply these operations.

Contribution

Two-stage training strategy with SFT and RL phases

The authors develop a two-stage training approach where supervised fine-tuning on 500K curated samples teaches code generation for image operations and computations, followed by reinforcement learning to refine the model's decision-making capabilities. The SFT stage requires only 200 GPU hours to activate fundamental abilities.

Contribution

GRPO-ATS algorithm with adaptive temperature sampling

The authors propose GRPO-ATS, a reinforcement learning algorithm that uses adaptive temperature sampling: temperature 0 for code generation to ensure determinism and execution validity, and temperature 1 for natural language reasoning to encourage diverse exploration. This balances creative reasoning with accurate, executable code.