PALC: Preference Alignment via Logit Calibration

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

AI alignmentRepresentation Editing

Aligning Large Language Models with human preferences typically requires computationally intensive training or complex reward architectures. We introduce PALC (Preference Alignment via Logit Calibration), a parameter-efficient framework that achieves test-time alignment through a novel intervention strategy: direct calibration in vocabulary space. Unlike existing methods that manipulate entangled hidden representations or rely on external reward models, PALC operates at the logit layer where each dimension corresponds to a distinct token, providing interpretable and efficient control. Our approach employs a bottleneck architecture that learns to compress the base model's hidden states and generate position-dependent calibration vectors, requiring only a fraction of the base model's parameters. Through this design, PALC sidesteps the superposition problem inherent in representation engineering while eliminating the computational overhead of guided decoding methods. A single scaling factor enables runtime adjustment of alignment strength without retraining, allowing practitioners to balance between preserving model capabilities and enforcing preferences. Experiments demonstrate that PALC outperforms most test-time alignment methods while maintaining near-baseline inference speed. Our ablations reveal that human preferences concentrate on surprisingly low-dimensional manifolds, validating our architectural choices. By establishing vocabulary-space intervention as an effective alignment paradigm, PALC makes preference alignment accessible for resource-constrained deployments where traditional methods are infeasible, opening new avenues for scalable and adaptive AI alignment.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PALC, a framework for test-time preference alignment through direct logit-space calibration. It resides in the 'Vocabulary-Space and Logit-Level Interventions' leaf, which contains only two papers including PALC itself. This leaf sits within the broader 'Direct Inference-Time Alignment Methods' branch, indicating a relatively sparse research direction compared to more crowded areas like reward-guided generation or training-time optimization. The taxonomy reveals that logit-level interventions represent an emerging approach rather than a saturated subfield, with most test-time alignment work concentrated in representation-space methods or reward-guided search.

The taxonomy structure shows PALC's leaf neighbors include representation-space interventions and reward-guided generation, both containing multiple papers. Representation-space methods modify hidden activations rather than logits, while reward-guided approaches use external models to steer generation. PALC's positioning suggests it bridges these directions by operating at the vocabulary layer where token probabilities are formed, avoiding the entanglement issues of hidden-state manipulation while maintaining direct control over outputs. The broader 'Direct Inference-Time Alignment Methods' branch encompasses four distinct leaves, indicating multiple parallel approaches to inference-time steering with varying levels of maturity.

Among 28 candidates examined across three contributions, only one refutable pair emerged. The 'vocabulary-space intervention paradigm' contribution examined 10 candidates with zero refutations, suggesting novelty in the core approach. The 'PALC framework with learned calibration vectors' contribution examined 8 candidates, also without refutation. However, 'parameter-efficient test-time alignment with runtime flexibility' examined 10 candidates and found 1 refutable case, indicating some overlap with prior work on efficient inference-time methods. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage, but the low refutation rate across most contributions suggests meaningful differentiation from examined prior work.

Based on the limited 28-candidate search, PALC appears to occupy a relatively novel position within test-time alignment research. The sparse population of its taxonomy leaf and low refutation rates suggest the logit-space calibration approach represents a distinct direction. However, the analysis cannot rule out relevant work outside the semantic search scope, particularly in adjacent areas like controllable generation or prompt-based steering that may not have surfaced in this preference-alignment-focused search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: test-time preference alignment of large language models. The field addresses how to steer model outputs toward desired preferences without retraining, organizing into several major branches. Direct Inference-Time Alignment Methods focus on immediate interventions during generation, including vocabulary-space manipulations and logit-level adjustments that reshape token distributions on the fly. Iterative Refinement and Feedback-Based Alignment encompasses approaches that generate multiple candidates and select or refine them using reward signals or verifiers. Training-Time and Hybrid Alignment Approaches blend offline learning with inference-time adaptation, while Evaluation, Reward Modeling, and Meta-Learning provide the scoring mechanisms and higher-level strategies that guide alignment. Inference Optimization and Computational Efficiency tackle the practical costs of test-time methods, and Specialized Applications extend these techniques to domain-specific settings. Surveys, Frameworks, and Theoretical Foundations offer broader perspectives on personalization and pluralistic preferences. Within Direct Inference-Time Alignment Methods, vocabulary-space and logit-level interventions represent a particularly active line of work, exploring how to modify token probabilities or hidden representations without iterative search. PALC[0] exemplifies this approach by operating directly on logits to enforce preference constraints at inference time, contrasting with methods like Direct Preference Heads[50] that learn specialized output layers for preference steering. These techniques trade off simplicity and computational overhead against the richer feedback loops seen in iterative refinement branches, where works such as DeAL[3] and DiffPO[5] leverage multiple generation rounds and reward models. The central tension across these branches involves balancing alignment quality with inference cost, and deciding whether to intervene early in the generation process or rely on post-hoc selection and reranking.

Claimed Contributions

Vocabulary-space intervention as a novel alignment paradigm

10 retrieved papers

The authors introduce vocabulary space (logit space) as a new intervention point for preference alignment, where calibrations are applied to the naturally disentangled logit layer rather than entangled hidden representations. This approach avoids the superposition problem inherent in hidden-state manipulation while maintaining interpretability.

10 retrieved papers

PALC framework with learned logit-space calibration vectors

8 retrieved papers

The authors propose PALC (Preference Alignment via Logit Calibration), a framework that uses a lightweight bottleneck architecture to generate position-specific calibration vectors in vocabulary space. The method processes hidden states as read-only context to produce calibrations without modifying internal representations.

8 retrieved papers

Parameter-efficient test-time alignment with runtime flexibility

Can Refute

10 retrieved papers

The authors show that PALC achieves effective preference alignment using only 0.13% additional parameters (9.2M for a 7B model) with minimal inference overhead (8% latency increase). A single scaling factor enables runtime adjustment of alignment strength without retraining, balancing capability preservation and preference enforcement.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[50] Would i lie to you? inference time alignment of language models using direct preference heads PDF

Ognjen Arandjelovic, Avelina Hadji-Kyriacou (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Vocabulary-space intervention as a novel alignment paradigm

[23] Sample, Don't Search: Rethinking Test-Time Alignment for Language Models PDF

Cannot Refute

[40] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search PDF

Cannot Refute

[60] Nudging: Inference-time Alignment of LLMs via Guided Decoding PDF

Cannot Refute

[61] Probabilistic token alignment for large language model fusion PDF

Cannot Refute

[62] Stochastic resonance pathways for latent knowledge reassembly in large language models PDF

Cannot Refute

[63] Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models PDF

Cannot Refute

[64] Cautious next token prediction PDF

Cannot Refute

[65] The optimization of the inference efficiency and ethical alignment of large language models via dynamic token flow mechanism PDF

Cannot Refute

[66] Decoding-time language model alignment with multiple objectives PDF

Cannot Refute

[67] Top-nð: Eliminating Noise in Logit Space for Robust Token Sampling of LLM PDF

Cannot Refute

Contribution

PALC framework with learned logit-space calibration vectors

[45] Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment PDF

Cannot Refute

[68] Drift: Decoding-time personalized alignments with implicit user preferences PDF

Cannot Refute

[69] Future Policy Aware Preference Learning for Mathematical Reasoning PDF

Cannot Refute

[70] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis PDF

Cannot Refute

[71] Logit Space Constrained Fine-Tuning for Mitigating Hallucinations in LLM-Based Recommender Systems PDF

Cannot Refute

[72] TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions PDF

Cannot Refute

[73] Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting PDF

Cannot Refute

[74] Toward a Two-Knob Projection for Supervisory Control of Language Models: Toward a Theory of Alignment Ops PDF

Cannot Refute

Contribution

Parameter-efficient test-time alignment with runtime flexibility

[40] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search PDF

Can Refute

[51] Li-YOLO Net: a lightweight steel defect detection framework with dynamic feature selection and task alignment PDF

Cannot Refute

[52] Dynamic Search for Inference-Time Alignment in Diffusion Models PDF

Cannot Refute

[53] TRACE: Training and Inference-Time Interpretability Analysis for Language Models PDF

Cannot Refute

[54] Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance PDF

Cannot Refute

[55] Improved Test-Time Adaptation for Domain Generalization PDF

Cannot Refute

[56] Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs PDF

Cannot Refute

[57] Key-Locked Rank One Editing for Text-to-Image Personalization PDF

Cannot Refute

[58] Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment PDF

Cannot Refute

[59] Steerable chatbots: Personalizing llms with preference-based activation steering PDF

Cannot Refute

PALC: Preference Alignment via Logit Calibration

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[50] Would i lie to you? inference time alignment of language models using direct preference heads PDF

Contribution Analysis

Vocabulary-space intervention as a novel alignment paradigm

[23] Sample, Don't Search: Rethinking Test-Time Alignment for Language Models PDF

[40] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search PDF

[60] Nudging: Inference-time Alignment of LLMs via Guided Decoding PDF

[61] Probabilistic token alignment for large language model fusion PDF

[62] Stochastic resonance pathways for latent knowledge reassembly in large language models PDF

[63] Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models PDF

[64] Cautious next token prediction PDF

[65] The optimization of the inference efficiency and ethical alignment of large language models via dynamic token flow mechanism PDF

[66] Decoding-time language model alignment with multiple objectives PDF

[67] Top-nð: Eliminating Noise in Logit Space for Robust Token Sampling of LLM PDF

PALC framework with learned logit-space calibration vectors

[45] Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment PDF

[68] Drift: Decoding-time personalized alignments with implicit user preferences PDF

[69] Future Policy Aware Preference Learning for Mathematical Reasoning PDF

[70] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis PDF

[71] Logit Space Constrained Fine-Tuning for Mitigating Hallucinations in LLM-Based Recommender Systems PDF

[72] TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions PDF

[73] Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting PDF

[74] Toward a Two-Knob Projection for Supervisory Control of Language Models: Toward a Theory of Alignment Ops PDF

Parameter-efficient test-time alignment with runtime flexibility

[40] MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search PDF

[51] Li-YOLO Net: a lightweight steel defect detection framework with dynamic feature selection and task alignment PDF

[52] Dynamic Search for Inference-Time Alignment in Diffusion Models PDF

[53] TRACE: Training and Inference-Time Interpretability Analysis for Language Models PDF

[54] Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance PDF

[55] Improved Test-Time Adaptation for Domain Generalization PDF

[56] Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs PDF

[57] Key-Locked Rank One Editing for Text-to-Image Personalization PDF

[58] Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment PDF

[59] Steerable chatbots: Personalizing llms with preference-based activation steering PDF

Table of Contents

[67] Top-nð: Eliminating Noise in Logit Space for Robust Token Sampling of LLM PDF