Test-Time Alignment for Large Language Models via Textual Model Predictive Control

ICLR 2026 Conference SubmissionAnonymous Authors
Test-time preference alignmentLarge Language ModelsMachine translation
Abstract:

Aligning Large Language Models (LLMs) with human preferences through finetuning is resource-intensive, motivating lightweight alternatives at test time. We address test-time alignment through the lens of sequential decision making, a perspective that reveals two fundamental challenges. When actions are defined at the token level, as in guided decoding, alignment suffers from the curse of horizon. Conversely, when actions are at the response level, as in traditional iterative refinement, the curse of dimensionality emerges. To resolve this trade-off, we draw inspiration from Model Predictive Control (MPC) in control theory to propose Textual Model Predictive Control (TMPC), a novel predictive planning framework adapted for aligning LLMs at inference time. A key limitation of standard MPC is its reliance on predefined, hard segment boundaries, which are often absent in text generation. TMPC overcomes this by introducing two principles inspired by hierarchical reinforcement learning: (1) Hindsight Subgoal Identification, where TMPC analyzes generation subgoals to retrospectively identify high-reward intermediate outputs as subgoals. This allows the framework to discover meaningful, task-specific planning steps (e.g., a sentence in machine translation or a bug fix in code generation.). (2) Subgoal-Conditioned Re-Generation, where these identified subgoals are used to guide subsequent planning iterations. By conditioning on these proven, high-quality subgoals, TMPC ensures stable improvement by building upon previously validated successes. TMPC is evaluated on three tasks with distinct segmentation properties: discourse-level translation, long-form response generation, and program synthesis. The results demonstrate that TMPC consistently improves performance, highlighting the generality.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Textual Model Predictive Control (TMPC), a planning-based framework for test-time alignment that draws on Model Predictive Control from control theory. It resides in the 'Tree Search and Planning Algorithms' leaf under 'Inference-Time Alignment via Response-Level Optimization', alongside two sibling papers (Reward Guided Tree and Tree Search Alignment). This leaf represents a focused but active research direction within the broader taxonomy of 50 papers across approximately 36 topics, indicating moderate crowding in the planning-based alignment space.

The taxonomy reveals that TMPC's leaf sits within a larger response-level optimization branch that includes Best-of-N sampling, iterative refinement, and continuous latent space methods. Neighboring branches address token-level decoding guidance and personalized multi-objective alignment. The scope note for TMPC's leaf explicitly includes 'reward-guided tree search or predictive planning' while excluding 'simple reranking and iterative textual refinement', positioning the work at the intersection of structured search and sequential decision-making. This placement suggests the paper engages with a well-defined but not oversaturated research direction.

Among 23 candidates examined across three contributions, no clearly refutable prior work was identified. The TMPC framework itself was assessed against 10 candidates with no refutations found; Hindsight Subgoal Identification examined 3 candidates with no overlaps; and Subgoal-Conditioned Re-Generation reviewed 10 candidates, also without refutation. These statistics reflect a limited semantic search scope rather than exhaustive coverage. The absence of refutable pairs among this candidate set suggests that the specific combination of MPC-inspired planning with hindsight subgoal discovery may represent a relatively unexplored angle within the planning-based alignment space.

Based on the limited search of 23 candidates, the work appears to occupy a distinct position within its taxonomy leaf, though the small candidate pool and focused sibling set (only two other papers) constrain definitive novelty claims. The analysis captures top-K semantic matches and does not guarantee comprehensive coverage of all relevant planning or hierarchical reinforcement learning methods that might inform test-time alignment. The contribution-level statistics suggest novelty in the specific technical approach, but broader field-wide uniqueness remains uncertain given the search limitations.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: test-time alignment of large language models with human preferences. The field has organized itself around several complementary strategies for steering model behavior at inference time without retraining. Inference-Time Alignment via Decoding Guidance focuses on token-level interventions during generation, while Response-Level Optimization methods such as tree search and planning algorithms explore multiple candidate outputs and select or refine them using reward signals. Personalized and Multi-Objective branches address the challenge of aligning to diverse or conflicting user preferences, as seen in works like Multidomain Preference Spectrum[3] and Personaagent[2]. Efficient methods leverage sentence-level or diffusion-styled techniques to reduce computational overhead, and specialized branches tackle domain-specific alignment (e.g., multimodal settings surveyed in Multimodal LLM Survey[4]) or provide theoretical foundations and evaluation benchmarks. Training-free approaches and reward modeling studies further enrich the landscape by exploring how to guide models without extensive fine-tuning or by improving the quality of preference signals themselves. Within the response-level optimization branch, tree search and planning algorithms represent a particularly active line of work that balances exploration and exploitation to find high-reward outputs. Textual Model Predictive Control[0] sits squarely in this cluster, employing a planning-based framework to iteratively refine generation trajectories. It shares conceptual ground with Reward Guided Tree[13] and Tree Search Alignment[37], which similarly use structured search to navigate the space of possible responses. Compared to methods like Inferaligner[5] that optimize at the response level through different mechanisms, or Integrated Value Guidance[6] that blends value functions into decoding, the planning-based approaches emphasize lookahead and sequential decision-making. This positioning highlights an ongoing tension in the field: whether to guide generation through local token-level adjustments, global response reranking, or intermediate planning strategies that explicitly model future consequences.

Claimed Contributions

Textual Model Predictive Control (TMPC) framework

The authors propose TMPC, a novel predictive planning framework adapted from Model Predictive Control in control theory for aligning LLMs at inference time without parameter updates. TMPC addresses the curse of horizon in guided decoding and curse of dimensionality in iterative refinement by operating at an intermediate subgoal level.

10 retrieved papers
Hindsight Subgoal Identification principle

This principle enables TMPC to discover meaningful planning steps by retrospectively analyzing generated rollouts and identifying high-quality intermediate points as subgoals. This addresses the problem of lacking natural boundaries in text generation by dynamically discovering task-specific planning units.

3 retrieved papers
Subgoal-Conditioned Re-Generation principle

This principle ensures stable, cumulative progress by storing identified subgoals in a buffer and using them to condition subsequent generation iterations. By building upon previously validated high-quality subgoals, TMPC ensures that each iteration improves upon proven successes rather than exploring randomly.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Textual Model Predictive Control (TMPC) framework

The authors propose TMPC, a novel predictive planning framework adapted from Model Predictive Control in control theory for aligning LLMs at inference time without parameter updates. TMPC addresses the curse of horizon in guided decoding and curse of dimensionality in iterative refinement by operating at an intermediate subgoal level.

Contribution

Hindsight Subgoal Identification principle

This principle enables TMPC to discover meaningful planning steps by retrospectively analyzing generated rollouts and identifying high-quality intermediate points as subgoals. This addresses the problem of lacking natural boundaries in text generation by dynamically discovering task-specific planning units.

Contribution

Subgoal-Conditioned Re-Generation principle

This principle ensures stable, cumulative progress by storing identified subgoals in a buffer and using them to condition subsequent generation iterations. By building upon previously validated high-quality subgoals, TMPC ensures that each iteration improves upon proven successes rather than exploring randomly.