PALC: Preference Alignment via Logit Calibration

ICLR 2026 Conference SubmissionAnonymous Authors
AI alignmentRepresentation Editing
Abstract:

Aligning Large Language Models with human preferences typically requires computationally intensive training or complex reward architectures. We introduce PALC (Preference Alignment via Logit Calibration), a parameter-efficient framework that achieves test-time alignment through a novel intervention strategy: direct calibration in vocabulary space. Unlike existing methods that manipulate entangled hidden representations or rely on external reward models, PALC operates at the logit layer where each dimension corresponds to a distinct token, providing interpretable and efficient control. Our approach employs a bottleneck architecture that learns to compress the base model's hidden states and generate position-dependent calibration vectors, requiring only a fraction of the base model's parameters. Through this design, PALC sidesteps the superposition problem inherent in representation engineering while eliminating the computational overhead of guided decoding methods. A single scaling factor enables runtime adjustment of alignment strength without retraining, allowing practitioners to balance between preserving model capabilities and enforcing preferences. Experiments demonstrate that PALC outperforms most test-time alignment methods while maintaining near-baseline inference speed. Our ablations reveal that human preferences concentrate on surprisingly low-dimensional manifolds, validating our architectural choices. By establishing vocabulary-space intervention as an effective alignment paradigm, PALC makes preference alignment accessible for resource-constrained deployments where traditional methods are infeasible, opening new avenues for scalable and adaptive AI alignment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PALC, a framework for test-time preference alignment through direct logit-space calibration. It resides in the 'Vocabulary-Space and Logit-Level Interventions' leaf, which contains only two papers including PALC itself. This leaf sits within the broader 'Direct Inference-Time Alignment Methods' branch, indicating a relatively sparse research direction compared to more crowded areas like reward-guided generation or training-time optimization. The taxonomy reveals that logit-level interventions represent an emerging approach rather than a saturated subfield, with most test-time alignment work concentrated in representation-space methods or reward-guided search.

The taxonomy structure shows PALC's leaf neighbors include representation-space interventions and reward-guided generation, both containing multiple papers. Representation-space methods modify hidden activations rather than logits, while reward-guided approaches use external models to steer generation. PALC's positioning suggests it bridges these directions by operating at the vocabulary layer where token probabilities are formed, avoiding the entanglement issues of hidden-state manipulation while maintaining direct control over outputs. The broader 'Direct Inference-Time Alignment Methods' branch encompasses four distinct leaves, indicating multiple parallel approaches to inference-time steering with varying levels of maturity.

Among 28 candidates examined across three contributions, only one refutable pair emerged. The 'vocabulary-space intervention paradigm' contribution examined 10 candidates with zero refutations, suggesting novelty in the core approach. The 'PALC framework with learned calibration vectors' contribution examined 8 candidates, also without refutation. However, 'parameter-efficient test-time alignment with runtime flexibility' examined 10 candidates and found 1 refutable case, indicating some overlap with prior work on efficient inference-time methods. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage, but the low refutation rate across most contributions suggests meaningful differentiation from examined prior work.

Based on the limited 28-candidate search, PALC appears to occupy a relatively novel position within test-time alignment research. The sparse population of its taxonomy leaf and low refutation rates suggest the logit-space calibration approach represents a distinct direction. However, the analysis cannot rule out relevant work outside the semantic search scope, particularly in adjacent areas like controllable generation or prompt-based steering that may not have surfaced in this preference-alignment-focused search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: test-time preference alignment of large language models. The field addresses how to steer model outputs toward desired preferences without retraining, organizing into several major branches. Direct Inference-Time Alignment Methods focus on immediate interventions during generation, including vocabulary-space manipulations and logit-level adjustments that reshape token distributions on the fly. Iterative Refinement and Feedback-Based Alignment encompasses approaches that generate multiple candidates and select or refine them using reward signals or verifiers. Training-Time and Hybrid Alignment Approaches blend offline learning with inference-time adaptation, while Evaluation, Reward Modeling, and Meta-Learning provide the scoring mechanisms and higher-level strategies that guide alignment. Inference Optimization and Computational Efficiency tackle the practical costs of test-time methods, and Specialized Applications extend these techniques to domain-specific settings. Surveys, Frameworks, and Theoretical Foundations offer broader perspectives on personalization and pluralistic preferences. Within Direct Inference-Time Alignment Methods, vocabulary-space and logit-level interventions represent a particularly active line of work, exploring how to modify token probabilities or hidden representations without iterative search. PALC[0] exemplifies this approach by operating directly on logits to enforce preference constraints at inference time, contrasting with methods like Direct Preference Heads[50] that learn specialized output layers for preference steering. These techniques trade off simplicity and computational overhead against the richer feedback loops seen in iterative refinement branches, where works such as DeAL[3] and DiffPO[5] leverage multiple generation rounds and reward models. The central tension across these branches involves balancing alignment quality with inference cost, and deciding whether to intervene early in the generation process or rely on post-hoc selection and reranking.

Claimed Contributions

Vocabulary-space intervention as a novel alignment paradigm

The authors introduce vocabulary space (logit space) as a new intervention point for preference alignment, where calibrations are applied to the naturally disentangled logit layer rather than entangled hidden representations. This approach avoids the superposition problem inherent in hidden-state manipulation while maintaining interpretability.

10 retrieved papers
PALC framework with learned logit-space calibration vectors

The authors propose PALC (Preference Alignment via Logit Calibration), a framework that uses a lightweight bottleneck architecture to generate position-specific calibration vectors in vocabulary space. The method processes hidden states as read-only context to produce calibrations without modifying internal representations.

8 retrieved papers
Parameter-efficient test-time alignment with runtime flexibility

The authors show that PALC achieves effective preference alignment using only 0.13% additional parameters (9.2M for a 7B model) with minimal inference overhead (8% latency increase). A single scaling factor enables runtime adjustment of alignment strength without retraining, balancing capability preservation and preference enforcement.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Vocabulary-space intervention as a novel alignment paradigm

The authors introduce vocabulary space (logit space) as a new intervention point for preference alignment, where calibrations are applied to the naturally disentangled logit layer rather than entangled hidden representations. This approach avoids the superposition problem inherent in hidden-state manipulation while maintaining interpretability.

Contribution

PALC framework with learned logit-space calibration vectors

The authors propose PALC (Preference Alignment via Logit Calibration), a framework that uses a lightweight bottleneck architecture to generate position-specific calibration vectors in vocabulary space. The method processes hidden states as read-only context to produce calibrations without modifying internal representations.

Contribution

Parameter-efficient test-time alignment with runtime flexibility

The authors show that PALC achieves effective preference alignment using only 0.13% additional parameters (9.2M for a 7B model) with minimal inference overhead (8% latency increase). A single scaling factor enables runtime adjustment of alignment strength without retraining, balancing capability preservation and preference enforcement.

PALC: Preference Alignment via Logit Calibration | Novelty Validation