Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

reasoningtest-timeinstance-levelpolicy gradientlatent spacelatent reasoning

Large Language Models (LLMs) typically reason through explicit, step-by-step natural-language traces. Humans, however, also rely on non-linguistic, unconscious processes, such as the inspirations that emerge during the incubation period. In this work, we introduce LatentSeek, a novel framework designed to enhance the reasoning capabilities of LLMs through Test-Time Instance-level Policy Gradient within the model’s latent space—thus complementing explicit natural-language steps. LatentSeek employs policy gradient optimization to iteratively refine latent representations, guided solely by a self-generated reward signal. This allows the model to adapt its reasoning trajectory dynamically on a per-instance basis. Empirical evaluations across diverse benchmarks, GSM8K, MATH-500, and AIME2024 as well as multiple LLM families (e.g., LLaMA, Qwen) demonstrate that LatentSeek outperforms established baselines, including Chain-of-Thought (CoT), Best-of-N (BoN) and training-based methods. Further analysis indicates that LatentSeek is computationally efficient, typically converging within a few optimization iterations for average-level problems. Moreover, the model's performance improves as the number of latent update iterations increases, highlighting the benefits of exploring within the latent space. These findings highlight LatentSeek as a lightweight and effective paradigm for improving the reasoning capabilities of LLMs without changing their parameters.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LatentSeek, a framework that applies policy gradient optimization to refine latent representations at test time for improved reasoning. It resides in the 'Policy Gradient-Based Latent Optimization' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of test-time latent reasoning. This leaf sits under 'Test-Time Latent Reasoning Optimization,' a branch that contrasts with training-based methods and multimodal approaches, suggesting the work occupies a focused niche exploring gradient-driven test-time adaptation rather than search-based or training-heavy alternatives.

The taxonomy reveals neighboring leaves such as 'Adaptive Compute Allocation in Latent Space' and 'Multimodal Latent Reasoning at Test Time,' which explore dynamic resource allocation and cross-modal reasoning respectively. LatentSeek diverges from these by concentrating on policy gradient updates within a single modality's latent space, rather than multimodal fusion or adaptive compute budgets. The broader 'Training-Based Latent Reasoning Frameworks' branch contains methods like reinforcement learning for latent reasoning and latent state transition modeling, which differ fundamentally by requiring parameter updates during training rather than test-time optimization alone.

Among the three contributions analyzed, the first two—LatentSeek framework and policy gradient optimization method—appear relatively novel within the limited search scope of 29 candidates, with zero refutable candidates found across 19 examined papers. The third contribution, test-time scaling analysis, encountered one refutable candidate among 10 examined, suggesting some prior work exists on analyzing computational scaling in latent reasoning. The statistics indicate that while the core framework and optimization approach show limited overlap with the examined literature, the scaling analysis component has more substantial prior coverage, though the search scope remains modest.

Based on the limited top-K semantic search and citation expansion covering 29 candidates, the work appears to occupy a sparsely populated research direction with minimal direct overlap in its core contributions. However, the analysis does not cover exhaustive literature review, and the single refutable pair for the scaling contribution suggests adjacent work exists. The taxonomy structure confirms this is an emerging area with few sibling papers, though definitive novelty claims would require broader literature coverage beyond the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: test-time reasoning enhancement through latent space policy gradient optimization. The field centers on improving model reasoning capabilities by optimizing latent representations at inference time, rather than relying solely on pre-trained weights. The taxonomy reveals several major branches: Test-Time Latent Reasoning Optimization focuses on methods that adapt or search within latent spaces during inference, often using policy gradients or search strategies to refine intermediate reasoning steps. Training-Based Latent Reasoning Frameworks emphasize learning structured latent representations during the training phase that facilitate downstream reasoning. Multimodal Latent Visual Reasoning extends these ideas to vision-language settings, where latent codes must bridge modalities. Domain-Specific Applications tailor latent reasoning to particular tasks such as robotics or autonomous driving, while Latent Skill and State Representation Learning explores how to discover and leverage abstract state or skill embeddings. Representative works like Seek in the Dark[1] and LARES[2] illustrate test-time optimization approaches, whereas Monet[4] and Latent Visual Reasoning[5] highlight multimodal extensions. Within the test-time optimization branch, a particularly active line of work explores policy gradient-based methods for refining latent reasoning trajectories on the fly. Test-Time Policy Gradient[0] sits squarely in this cluster, emphasizing direct policy gradient updates in latent space to enhance reasoning quality without additional training. This contrasts with nearby approaches: Seek in the Dark[1] employs search-based strategies to navigate latent spaces, while Thinking on the Fly[11] and Think Silently Fast[12] investigate adaptive computation budgets and silent reasoning tokens. The central trade-off across these methods involves balancing computational overhead at test time against the gains in reasoning accuracy or robustness. Test-Time Policy Gradient[0] distinguishes itself by leveraging policy gradients to iteratively improve latent representations, positioning it as a gradient-driven alternative to search-heavy or token-based reasoning augmentation strategies.

Claimed Contributions

LATENTSEEK framework for test-time instance-level policy gradient in latent space

10 retrieved papers

The authors propose LATENTSEEK, a framework that enhances LLM reasoning by performing test-time optimization of latent representations using policy gradient methods. Unlike training-based approaches, it operates on frozen models and dynamically refines reasoning trajectories for each problem instance without parameter updates.

10 retrieved papers

Policy gradient optimization method for latent representations

9 retrieved papers

The authors develop a policy gradient-based optimization procedure that iteratively updates token-wise latent representations guided by self-generated reward signals. This method treats latent representations as independent variables and uses REINFORCE to perform gradient ascent in the latent space.

9 retrieved papers

Test-time scaling analysis in latent space

Can Refute

10 retrieved papers

The authors demonstrate that reasoning performance improves as the number of latent-space optimization iterations increases, establishing a complementary scaling dimension beyond token generation. This reveals that exploration within the latent space offers a promising direction for test-time scaling.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space PDF

Li Hengli, Li Chenxi, Wu, Tong, Zhu, Xuekai, Wang Yu-xuan, Yu Zhao-Xin, Song-Chun, Jia, Zixia, Ying Nian, Zheng, Zilong (2025)

[11] Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization PDF

Liang Yan, Shan LianLei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LATENTSEEK framework for test-time instance-level policy gradient in latent space

[1] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space PDF

Cannot Refute

[3] Hybrid Latent Reasoning via Reinforcement Learning PDF

Cannot Refute

[5] Latent visual reasoning PDF

Cannot Refute

[10] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning PDF

Cannot Refute

[14] Combinatorial Optimization with Policy Adaptation using Latent Space Search PDF

Cannot Refute

[29] Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning PDF

Cannot Refute

[30] Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models PDF

Cannot Refute

[31] Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning PDF

Cannot Refute

[32] Difftori: Differentiable trajectory optimization for deep reinforcement and imitation learning PDF

Cannot Refute

[33] Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization PDF

Cannot Refute

Contribution

Policy gradient optimization method for latent representations

[1] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space PDF

Cannot Refute

[7] Steering Your Diffusion Policy with Latent Space Reinforcement Learning PDF

Cannot Refute

[21] Meta-reinforcement learning algorithm based on reward and dynamic inference PDF

Cannot Refute

[22] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs PDF

Cannot Refute

[23] Visual reinforcement learning with imagined goals PDF

Cannot Refute

[24] Interpretable multi-agent reinforcement learning via multi-head variational autoencoders PDF

Cannot Refute

[25] Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning PDF

Cannot Refute

[27] Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning PDF

Cannot Refute

[28] Reasoning with latent diffusion in offline reinforcement learning PDF

Cannot Refute

Contribution

Test-time scaling analysis in latent space

[34] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach PDF

Can Refute

[10] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning PDF

Cannot Refute

[35] Self-Refine: Iterative Refinement with Self-Feedback PDF

Cannot Refute

[36] REFINER: Reasoning Feedback on Intermediate Representations PDF

Cannot Refute

[37] Investigating inference-time scaling for chain of multi-modal thought: A preliminary study PDF

Cannot Refute

[38] Iterative Reasoning Preference Optimization PDF

Cannot Refute

[39] A survey of scaling in large language model reasoning PDF

Cannot Refute

[40] MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning PDF

Cannot Refute

[41] Inference-time alignment in continuous space PDF

Cannot Refute

[42] Think before recommend: Unleashing the latent reasoning power for sequential recommendation PDF

Cannot Refute

Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space PDF

[11] Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization PDF

Contribution Analysis

LATENTSEEK framework for test-time instance-level policy gradient in latent space

[1] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space PDF

[3] Hybrid Latent Reasoning via Reinforcement Learning PDF

[5] Latent visual reasoning PDF

[10] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning PDF

[14] Combinatorial Optimization with Policy Adaptation using Latent Space Search PDF

[29] Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning PDF

[30] Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models PDF

[31] Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning PDF

[32] Difftori: Differentiable trajectory optimization for deep reinforcement and imitation learning PDF

[33] Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization PDF

Policy gradient optimization method for latent representations

[1] Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space PDF

[7] Steering Your Diffusion Policy with Latent Space Reinforcement Learning PDF

[21] Meta-reinforcement learning algorithm based on reward and dynamic inference PDF

[22] Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs PDF

[23] Visual reinforcement learning with imagined goals PDF

[24] Interpretable multi-agent reinforcement learning via multi-head variational autoencoders PDF

[25] Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning PDF

[27] Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning PDF

[28] Reasoning with latent diffusion in offline reinforcement learning PDF

Test-time scaling analysis in latent space

[34] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach PDF

[10] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning PDF

[35] Self-Refine: Iterative Refinement with Self-Feedback PDF

[36] REFINER: Reasoning Feedback on Intermediate Representations PDF

[37] Investigating inference-time scaling for chain of multi-modal thought: A preliminary study PDF

[38] Iterative Reasoning Preference Optimization PDF

[39] A survey of scaling in large language model reasoning PDF

[40] MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning PDF

[41] Inference-time alignment in continuous space PDF

[42] Think before recommend: Unleashing the latent reasoning power for sequential recommendation PDF

Table of Contents