Test-Time Alignment for Large Language Models via Textual Model Predictive Control

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Test-time preference alignmentLarge Language ModelsMachine translation

Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Textual Model Predictive Control (TMPC), a planning-based framework for test-time alignment that draws on Model Predictive Control from control theory. It resides in the 'Tree Search and Planning Algorithms' leaf under 'Inference-Time Alignment via Response-Level Optimization', alongside two sibling papers (Reward Guided Tree and Tree Search Alignment). This leaf represents a focused but active research direction within the broader taxonomy of 50 papers across approximately 36 topics, indicating moderate crowding in the planning-based alignment space.

The taxonomy reveals that TMPC's leaf sits within a larger response-level optimization branch that includes Best-of-N sampling, iterative refinement, and continuous latent space methods. Neighboring branches address token-level decoding guidance and personalized multi-objective alignment. The scope note for TMPC's leaf explicitly includes 'reward-guided tree search or predictive planning' while excluding 'simple reranking and iterative textual refinement', positioning the work at the intersection of structured search and sequential decision-making. This placement suggests the paper engages with a well-defined but not oversaturated research direction.

Among 23 candidates examined across three contributions, no clearly refutable prior work was identified. The TMPC framework itself was assessed against 10 candidates with no refutations found; Hindsight Subgoal Identification examined 3 candidates with no overlaps; and Subgoal-Conditioned Re-Generation reviewed 10 candidates, also without refutation. These statistics reflect a limited semantic search scope rather than exhaustive coverage. The absence of refutable pairs among this candidate set suggests that the specific combination of MPC-inspired planning with hindsight subgoal discovery may represent a relatively unexplored angle within the planning-based alignment space.

Based on the limited search of 23 candidates, the work appears to occupy a distinct position within its taxonomy leaf, though the small candidate pool and focused sibling set (only two other papers) constrain definitive novelty claims. The analysis captures top-K semantic matches and does not guarantee comprehensive coverage of all relevant planning or hierarchical reinforcement learning methods that might inform test-time alignment. The contribution-level statistics suggest novelty in the specific technical approach, but broader field-wide uniqueness remains uncertain given the search limitations.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: test-time alignment of large language models with human preferences. The field has organized itself around several complementary strategies for steering model behavior at inference time without retraining. Inference-Time Alignment via Decoding Guidance focuses on token-level interventions during generation, while Response-Level Optimization methods such as tree search and planning algorithms explore multiple candidate outputs and select or refine them using reward signals. Personalized and Multi-Objective branches address the challenge of aligning to diverse or conflicting user preferences, as seen in works like Multidomain Preference Spectrum[3] and Personaagent[2]. Efficient methods leverage sentence-level or diffusion-styled techniques to reduce computational overhead, and specialized branches tackle domain-specific alignment (e.g., multimodal settings surveyed in Multimodal LLM Survey[4]) or provide theoretical foundations and evaluation benchmarks. Training-free approaches and reward modeling studies further enrich the landscape by exploring how to guide models without extensive fine-tuning or by improving the quality of preference signals themselves. Within the response-level optimization branch, tree search and planning algorithms represent a particularly active line of work that balances exploration and exploitation to find high-reward outputs. Textual Model Predictive Control[0] sits squarely in this cluster, employing a planning-based framework to iteratively refine generation trajectories. It shares conceptual ground with Reward Guided Tree[13] and Tree Search Alignment[37], which similarly use structured search to navigate the space of possible responses. Compared to methods like Inferaligner[5] that optimize at the response level through different mechanisms, or Integrated Value Guidance[6] that blends value functions into decoding, the planning-based approaches emphasize lookahead and sequential decision-making. This positioning highlights an ongoing tension in the field: whether to guide generation through local token-level adjustments, global response reranking, or intermediate planning strategies that explicitly model future consequences.

Claimed Contributions

Textual Model Predictive Control (TMPC) framework

10 retrieved papers

The authors propose TMPC, a novel predictive planning framework adapted from Model Predictive Control in control theory for aligning LLMs at inference time without parameter updates. TMPC addresses the curse of horizon in guided decoding and curse of dimensionality in iterative refinement by operating at an intermediate subgoal level.

10 retrieved papers

Hindsight Subgoal Identification principle

3 retrieved papers

This principle enables TMPC to discover meaningful planning steps by retrospectively analyzing generated rollouts and identifying high-quality intermediate points as subgoals. This addresses the problem of lacking natural boundaries in text generation by dynamically discovering task-specific planning units.

3 retrieved papers

Subgoal-Conditioned Re-Generation principle

10 retrieved papers

This principle ensures stable, cumulative progress by storing identified subgoals in a buffer and using them to condition subsequent generation iterations. By building upon previously validated high-quality subgoals, TMPC ensures that each iteration improves upon proven successes rather than exploring randomly.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Reward-Guided Tree Search for Inference Time Alignment of Large Language Models PDF

Chia-Yu Hung, Navonil Majumder, Ambuj Mehrish, Soujanya Poria (2025)

[37] Inference Time Alignment with Reward-Guided Tree Search PDF

Majumder, Navonil, Chia-Yu Hung, Mehrish, Ambuj, Navonil Majumder, Poria, Soujanya, Ambuj Mehrish, Soujanya Poria (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Textual Model Predictive Control (TMPC) framework

[64] Plato: Plan to efficiently decode for large language model inference PDF

Cannot Refute

[65] Robots that ask for help: Uncertainty alignment for large language model planners PDF

Cannot Refute

[66] Collaborative LLM Inference via Planning for Efficient Reasoning PDF

Cannot Refute

[67] Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving PDF

Cannot Refute

[68] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search PDF

Cannot Refute

[69] Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models PDF

Cannot Refute

[70] LBAP: Improved Uncertainty Alignment of LLM Planners using Bayesian Inference PDF

Cannot Refute

[71] Unifying Inference-Time Planning Language Generation PDF

Cannot Refute

[72] Plan2Align: Predictive Planning Based Test-Time Preference Alignment for Large Language Models PDF

Cannot Refute

[73] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search PDF

Cannot Refute

Contribution

Hindsight Subgoal Identification principle

[51] Retroformer: Retrospective large language agents with policy gradient optimization PDF

Cannot Refute

[52] Guided stream of search: Learning to better search with language models via optimal path guidance PDF

Cannot Refute

[53] (Mis?)-Using DRT for generation of natural language text from image sequences PDF

Cannot Refute

Contribution

Subgoal-Conditioned Re-Generation principle

[54] Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving PDF

Cannot Refute

[55] Single image reflection removal through cascaded refinement PDF

Cannot Refute

[56] Structure-preserving deraining with residue channel prior guidance PDF

Cannot Refute

[57] Program synthesis by type-guided abstraction refinement PDF

Cannot Refute

[58] Iq-vfi: Implicit quadratic motion estimation for video frame interpolation PDF

Cannot Refute

[59] Improving steering and verification in AI-assisted data analysis with interactive task decomposition PDF

Cannot Refute

[60] Are llms rigorous logical reasoner? empowering natural language proof generation with contrastive stepwise decoding PDF

Cannot Refute

[61] Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation PDF

Cannot Refute

[62] Peria: Perceive, reason, imagine, act via holistic language and vision planning for manipulation PDF

Cannot Refute

[63] Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization PDF

Cannot Refute

Test-Time Alignment for Large Language Models via Textual Model Predictive Control

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Reward-Guided Tree Search for Inference Time Alignment of Large Language Models PDF

[37] Inference Time Alignment with Reward-Guided Tree Search PDF

Contribution Analysis

Textual Model Predictive Control (TMPC) framework

[64] Plato: Plan to efficiently decode for large language model inference PDF

[65] Robots that ask for help: Uncertainty alignment for large language model planners PDF

[66] Collaborative LLM Inference via Planning for Efficient Reasoning PDF

[67] Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving PDF

[68] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search PDF

[69] Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models PDF

[70] LBAP: Improved Uncertainty Alignment of LLM Planners using Bayesian Inference PDF

[71] Unifying Inference-Time Planning Language Generation PDF

[72] Plan2Align: Predictive Planning Based Test-Time Preference Alignment for Large Language Models PDF

[73] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search PDF

Hindsight Subgoal Identification principle

[51] Retroformer: Retrospective large language agents with policy gradient optimization PDF

[52] Guided stream of search: Learning to better search with language models via optimal path guidance PDF

[53] (Mis?)-Using DRT for generation of natural language text from image sequences PDF

Subgoal-Conditioned Re-Generation principle

[54] Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving PDF

[55] Single image reflection removal through cascaded refinement PDF

[56] Structure-preserving deraining with residue channel prior guidance PDF

[57] Program synthesis by type-guided abstraction refinement PDF

[58] Iq-vfi: Implicit quadratic motion estimation for video frame interpolation PDF

[59] Improving steering and verification in AI-assisted data analysis with interactive task decomposition PDF

[60] Are llms rigorous logical reasoner? empowering natural language proof generation with contrastive stepwise decoding PDF

[61] Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation PDF

[62] Peria: Perceive, reason, imagine, act via holistic language and vision planning for manipulation PDF

[63] Self-Conditioning via Intermediate Predictions for End-to-End Neural Speaker Diarization PDF

Table of Contents