ProxyThinker: Test-Time Guidance through Small Visual Reasoners

ICLR 2026 Conference SubmissionAnonymous Authors
decoding-time algorithmsvisual reasoning
Abstract:

Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multidisciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://anonymous.4open.science/r/ProxyThinker-FAAF.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ProxyThinker, an inference-time technique that transfers visual reasoning capabilities from small reinforcement-fine-tuned models to large base models by manipulating output distributions during decoding. It resides in the Distribution-Based Inference-Time Guidance leaf, which currently contains only this paper as a sibling. This positioning suggests the work occupies a relatively sparse research direction within the broader taxonomy of test-time transfer methods, distinguishing itself from the more populated knowledge distillation and fine-tuning branches that dominate the field.

The taxonomy reveals neighboring approaches in Test-Time Inference Guidance, though no other papers share the exact distribution-based mechanism. Adjacent branches include Knowledge Distillation methods that require training phases and Reasoning Transfer techniques that generate synthetic data for fine-tuning. ProxyThinker diverges from these by avoiding model retraining entirely, instead leveraging runtime distribution subtraction. The scope boundaries clarify that training-based transfer methods belong elsewhere, positioning this work as a pure inference-time intervention that contrasts with the distillation-heavy approaches found in sibling branches like Cross-Modal Knowledge Distillation and Chain-of-Thought Reasoning Transfer.

Among 29 candidates examined across three contributions, the logit-delta steering mechanism shows overlap with one prior work, while the core ProxyThinker technique and parallelism implementation appear more novel within the limited search scope. Specifically, the logit-delta contribution examined 9 candidates with 1 refutable match, suggesting some precedent for distribution manipulation strategies. The other two contributions each examined 10 candidates with no clear refutations, indicating less direct prior work among the top semantic matches. These statistics reflect a focused literature search rather than exhaustive coverage, leaving open the possibility of additional related work beyond the top-K results.

Based on the limited search scope of 29 candidates, the work appears to introduce a relatively fresh approach to test-time reasoning transfer, particularly in its application to vision-language models. The sparse taxonomy leaf and low refutation rate suggest novelty, though the single overlap on logit-delta steering indicates the underlying distribution manipulation concept has precedent. The analysis covers top semantic matches and does not claim exhaustive field coverage.

Taxonomy

Core-task Taxonomy Papers
15
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Test-time transfer of visual reasoning capabilities from small to large models. The field addresses how to leverage compact models to enhance the reasoning performance of larger vision-language systems without retraining. The taxonomy reveals four main branches: Test-Time Inference Guidance and Decoding Methods focus on runtime mechanisms that steer large model outputs using signals from smaller models, often through distribution alignment or process-level guidance (e.g., Virgo[3], Vision Language Process Rewards[4]). Knowledge Distillation for Vision-Language Transfer encompasses techniques that compress reasoning patterns from teachers to students, including compositional and modality-balanced approaches (e.g., CompoDistill[9], Mitigate Modality Imbalance[7]). Reasoning Transfer via Fine-Tuning and Data Generation explores how synthetic data or intermediate reasoning steps can be generated by small models to improve larger ones (e.g., Step Back Visual Reasoning[10], Symbolic to Object Embeddings[11]). Generalization and Robustness in Visual Reasoning examines how transferred capabilities hold up across diverse tasks and domains, addressing questions of transferability and compositional understanding. A central tension across these branches is whether to intervene at inference time versus training time, and whether to rely on explicit symbolic reasoning or implicit distributional guidance. ProxyThinker[0] sits within the Distribution-Based Inference-Time Guidance cluster, emphasizing runtime transfer without model updates. This contrasts with distillation-focused works like Reasoning Teachers[5] or Pretraining Knowledge Distillation[8], which bake reasoning patterns into model weights during training. Compared to process-reward methods such as Vision Language Process Rewards[4], ProxyThinker[0] appears to leverage distributional alignment from a smaller proxy model to guide the larger model's decoding, offering a lightweight alternative that avoids the overhead of reward modeling or fine-tuning. The broader challenge remains how to balance the efficiency gains of small-model guidance with the need for robust generalization across varied visual reasoning tasks.

Claimed Contributions

ProxyThinker inference-time technique for visual reasoning transfer

The authors introduce ProxyThinker, a training-free decoding method that transfers visual reasoning abilities from small RFT-trained models to large base models by subtracting amateur model logits from expert model logits during inference, enabling slow-thinking reasoning behaviors without parameter updates.

10 retrieved papers
Efficient multi-model implementation with parallelism techniques

The authors develop an optimized implementation on vLLM that leverages tensor parallelism and asynchronous execution across multiple models, achieving 38× speedup over previous decoding-time steering methods while minimizing GPU idle time.

10 retrieved papers
Logit-delta steering mechanism for reasoning behavior transfer

The method modifies next-token prediction by adding a scaled difference between expert and amateur model logits to the base model logits, successfully transferring sophisticated reasoning behaviors such as self-verification and self-correction from small to large models.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ProxyThinker inference-time technique for visual reasoning transfer

The authors introduce ProxyThinker, a training-free decoding method that transfers visual reasoning abilities from small RFT-trained models to large base models by subtracting amateur model logits from expert model logits during inference, enabling slow-thinking reasoning behaviors without parameter updates.

Contribution

Efficient multi-model implementation with parallelism techniques

The authors develop an optimized implementation on vLLM that leverages tensor parallelism and asynchronous execution across multiple models, achieving 38× speedup over previous decoding-time steering methods while minimizing GPU idle time.

Contribution

Logit-delta steering mechanism for reasoning behavior transfer

The method modifies next-token prediction by adding a scaled difference between expert and amateur model logits to the base model logits, successfully transferring sophisticated reasoning behaviors such as self-verification and self-correction from small to large models.