TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Process Reward ModelTabular ReasoningTool IntegrationTest-time Scaling
Abstract:

Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces TaTToo, a process reward model framework specifically designed for tabular reasoning with test-time scaling. It resides in the 'Tool-Augmented PRMs for Tabular Reasoning' leaf, which contains only two papers total (including this work). This represents a sparse, emerging research direction within the broader taxonomy of nine papers across process reward modeling. The work addresses a recognized gap: existing PRMs struggle with table-specific operations like sub-table retrieval and schema interaction, motivating a domain-specialized approach.

The taxonomy reveals that TaTToo sits at the intersection of three broader research threads: domain-specific process reward modeling, test-time scaling strategies, and PRM training paradigms. Neighboring leaves include 'Generative and Reasoning-Driven PRMs' (three papers) and 'Inference-Time Scaling for Tabular Reasoning Tasks' (two papers). The framework bridges these areas by combining tool-based verification (domain-specific) with reinforcement learning for test-time search (inference-time scaling). This positioning suggests the work synthesizes ideas from multiple established directions rather than pioneering an entirely new branch.

Among the three contributions analyzed, the literature search examined twenty-eight candidates total. The 'Tool-Grounded Thinking PRM Framework' examined ten candidates with zero refutable matches; the 'Scalable Data Curation Pipeline' also examined ten with zero refutations; the 'Dual-Stage Training Paradigm' examined eight with zero refutations. These statistics indicate that within the limited search scope, no prior work was identified that directly overlaps with the specific combination of tool-grounded process rewards and dual-stage training for tabular reasoning. However, the small candidate pool and sparse taxonomy leaf suggest this assessment reflects limited coverage rather than exhaustive validation.

Given the sparse taxonomy leaf and limited search scope, the work appears to occupy a relatively unexplored niche combining process reward modeling with tabular tool use. The absence of refutable candidates across all contributions may reflect genuine novelty in this specific integration, or may indicate that the semantic search did not surface closely related work outside the top-thirty matches. The analysis covers domain-specific PRM design but does not exhaustively address broader tabular reasoning or general test-time scaling literature.

Taxonomy

Core-task Taxonomy Papers
9
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: process reward modeling for tabular reasoning with test-time scaling. The field structure reflects a convergence of process-level supervision, domain-specific reasoning challenges, and inference-time computation strategies. The taxonomy organizes work into several main branches: architectures and training paradigms for process reward models (PRMs) that provide step-by-step feedback, domain-specific applications where PRMs are tailored to particular reasoning contexts such as mathematical problem-solving or structured data interpretation, test-time scaling strategies that allocate additional computation during inference to improve solution quality, and comprehensive surveys synthesizing these developments. Representative works like Process Reward Thinking[1] and Rewarding Progress[3] illustrate foundational PRM training methods, while efforts such as GenPRM[5] explore generative formulations that expand the scope of process supervision beyond traditional classification-based reward assignment. A particularly active line of work focuses on tool-augmented PRMs for tabular reasoning, where models must interact with structured data through executable operations. This setting introduces unique challenges in credit assignment and verification, as intermediate steps involve both symbolic manipulation and semantic understanding of table contents. TaTToo[0] situates itself within this specialized branch, emphasizing the integration of process rewards with test-time search over tool-assisted reasoning traces. Compared to Table-r1[4] and Table-R1[6], which also target tabular domains, TaTToo[0] places stronger emphasis on the interplay between step-level reward signals and adaptive inference-time computation budgets. Meanwhile, approaches like Adaptive Test-Time[8] explore dynamic allocation strategies across domains, and R-PRM[7] investigates reward model robustness. The central tension across these works involves balancing the granularity of process supervision, the computational overhead of test-time scaling, and the reliability of learned reward signals in guiding multi-step reasoning over structured data.

Claimed Contributions

TaTToo: Tool-Grounded Thinking PRM Framework

The authors introduce TaTToo, a process reward model specifically designed for tabular reasoning that provides step-level supervision by explicitly reasoning over table operations and incorporating external tools for verification. This framework addresses limitations of existing PRMs in supervising table retrieval and schema interaction steps.

10 retrieved papers
Scalable Data Curation Pipeline with Tool-Augmented Annotations

The authors develop a three-stage data curation pipeline that synthesizes over 60,000 high-quality training instances by collecting expert verification rationales, assigning table-aware rewards, and augmenting them with tool invocations and execution results for training the PRM.

10 retrieved papers
Dual-Stage Training Paradigm with Tool-Grounded Reward Shaping

The authors propose a two-stage training approach that first uses supervised fine-tuning to learn tool-integrated verification patterns, then applies reinforcement learning with a novel reward shaping scheme that includes label-matching, confidence calibration, and tool-grounding components to optimize the PRM for accurate table verification.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

TaTToo: Tool-Grounded Thinking PRM Framework

The authors introduce TaTToo, a process reward model specifically designed for tabular reasoning that provides step-level supervision by explicitly reasoning over table operations and incorporating external tools for verification. This framework addresses limitations of existing PRMs in supervising table retrieval and schema interaction steps.

Contribution

Scalable Data Curation Pipeline with Tool-Augmented Annotations

The authors develop a three-stage data curation pipeline that synthesizes over 60,000 high-quality training instances by collecting expert verification rationales, assigning table-aware rewards, and augmenting them with tool invocations and execution results for training the PRM.

Contribution

Dual-Stage Training Paradigm with Tool-Grounded Reward Shaping

The authors propose a two-stage training approach that first uses supervised fine-tuning to learn tool-integrated verification patterns, then applies reinforcement learning with a novel reward shaping scheme that includes label-matching, confidence calibration, and tool-grounding components to optimize the PRM for accurate table verification.

TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning | Novelty Validation