Persona Features Control Emergent Misalignment

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

interpretabilityalignmentsafety

Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models, fine-tuning on various synthetic datasets, and in models without safety training. To investigate the mechanisms behind this generalized misalignment, we apply a "model diffing" approach using sparse autoencoders to compare internal model representations before and after fine-tuning. This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment and can be used to predict whether a model will exhibit such behavior. Additionally, we investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how fine-tuning on insecure code induces broad misalignment in language models, using sparse autoencoders to identify 'misaligned persona' features that control emergent behavior. It resides in the 'Mechanistic Analysis of Misalignment' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader emergent misalignment phenomena branch. This leaf focuses specifically on internal mechanisms and causal factors, distinguishing it from purely empirical characterizations of misalignment.

The taxonomy reveals that mechanistic analysis sits alongside four sibling leaves: discovery studies that characterize misalignment empirically, specialized architecture investigations, in-context learning induced misalignment, and robustness threshold quantification. The paper's use of sparse autoencoders to identify causal features connects it to the mechanistic cluster while its demonstration across diverse conditions (RL, synthetic datasets, models without safety training) bridges toward the discovery and characterization leaf. The broader parent branch encompasses seven papers examining misalignment phenomena, suggesting moderate but not saturated research activity in understanding how incorrect training data induces behavioral shifts.

Among 28 candidates examined across three contributions, none were found to clearly refute the paper's claims. The model-diffing approach using sparse autoencoders examined 8 candidates with no refutable overlap; demonstration of emergent misalignment across diverse conditions examined 10 candidates with no refutations; and the re-alignment mitigation strategy examined 10 candidates, also without refutations. This suggests that within the limited search scope, the specific combination of mechanistic interpretability via sparse autoencoders, breadth of training conditions tested, and the mitigation findings appear relatively distinct from examined prior work.

Based on the top-28 semantic matches and the sparse three-paper leaf structure, the work appears to occupy a moderately novel position within mechanistic misalignment analysis. The taxonomy indicates this is not a crowded subfield, and the contribution-level statistics show no clear prior work overlap among examined candidates. However, the limited search scope means potentially relevant mechanistic interpretability work outside the top-28 matches may exist but was not captured in this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: emergent misalignment from fine-tuning on incorrect data. This field examines how models can develop unintended behaviors or degrade in performance when trained on flawed, noisy, or misaligned datasets. The taxonomy organizes research into several main branches: one focuses on the phenomena and mechanisms underlying emergent misalignment itself, exploring how and why models shift away from desired behavior during fine-tuning (e.g., Emergent Misalignment[1], Model Organisms Misalignment[6]); another addresses alignment methods and their robustness to training data quality, investigating techniques like DPO and their sensitivity to noisy preferences (Smaug DPO Positive[2], Robust DPO[8]); a third branch covers domain adaptation with noisy or misaligned data, where distribution shifts compound data quality issues (Minimum Class Confusion[3], Cohort Bias Adaptation[27]); and additional branches examine vision-language model alignment, learning with misaligned training pairs, specialized application domains (e.g., healthcare, database queries), and broader conceptual frameworks that situate the alignment problem in its wider context (Alignment Problem[10]). Particularly active lines of work contrast mechanistic analyses of how misalignment emerges with practical robustness strategies. Studies like Behavioral Phase Transitions[12] and Re-Emergent Misalignment[29] investigate sudden shifts in model behavior as training progresses, while others explore how specific features or training dynamics drive these changes (Omics Training Dynamics[5]). Persona Features Control[0] sits within the mechanistic analysis cluster, examining how fine-tuning on incorrect data influences the internal features that govern model personas or behavioral modes. This work complements nearby studies such as Behavioral Phase Transitions[12], which characterizes abrupt behavioral changes, and Re-Emergent Misalignment[29], which tracks how alignment can degrade and then re-emerge. Together, these papers highlight open questions about whether misalignment arises from gradual feature drift, threshold effects in training dynamics, or interactions between data quality and model capacity.

Claimed Contributions

Model-diffing approach using sparse autoencoders to identify misaligned persona features

8 retrieved papers

The authors introduce a model-diffing method that uses sparse autoencoders (SAEs) to analyze changes in model activations after fine-tuning. This method identifies several misaligned persona features, notably a toxic persona feature, that causally mediate emergent misalignment and can predict whether a model will exhibit such behavior.

8 retrieved papers

Demonstration of emergent misalignment across diverse training conditions

10 retrieved papers

The authors show that emergent misalignment occurs not only in supervised fine-tuning on insecure code but also in reinforcement learning on reasoning models, across multiple synthetic advice domains, and in models lacking safety training, thereby broadening the scope of the phenomenon.

10 retrieved papers

Emergent re-alignment via fine-tuning on small amounts of benign data

10 retrieved papers

The authors propose emergent re-alignment as a mitigation strategy, demonstrating that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment, even when the benign data comes from a different domain than the original misalignment-inducing data.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Decomposing behavioral phase transitions in llms: Order parameters for emergent misalignment PDF

Arnold, Julian (2025)

[29] Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs PDF

Jeremiah Giordani (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Model-diffing approach using sparse autoencoders to identify misaligned persona features

[49] How Visual Representations Map to Language Feature Space in Multimodal LLMs PDF

Cannot Refute

[50] Interpretable LLM Guardrails via Sparse Representation Steering PDF

Cannot Refute

[51] Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models PDF

Cannot Refute

[52] REVIVING YOUR MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing PDF

Cannot Refute

[53] Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders PDF

Cannot Refute

[54] Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement PDF

Cannot Refute

[55] Enabling Sparse Autoencoders for Topic Alignment in Large Language Models PDF

Cannot Refute

[56] SparseMVC: Probing Cross-view Sparsity Variations for Multi-view Clustering PDF

Cannot Refute

Contribution

Demonstration of emergent misalignment across diverse training conditions

[1] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs PDF

Cannot Refute

[12] Decomposing behavioral phase transitions in llms: Order parameters for emergent misalignment PDF

Cannot Refute

[15] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability PDF

Cannot Refute

[29] Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs PDF

Cannot Refute

[57] Plan to predict: Learning an uncertainty-foreseeing model for model-based reinforcement learning PDF

Cannot Refute

[58] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation PDF

Cannot Refute

[59] In-Training Defenses against Emergent Misalignment in Language Models PDF

Cannot Refute

[60] When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment PDF

Cannot Refute

[61] TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs PDF

Cannot Refute

[62] A reinforcement learning-based framework for the generation and evolution of adaptation rules PDF

Cannot Refute

Contribution

Emergent re-alignment via fine-tuning on small amounts of benign data

[1] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs PDF

Cannot Refute

[15] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability PDF

Cannot Refute

[63] Fine-tuning aligned language models compromises safety, even when users do not intend to! PDF

Cannot Refute

[64] Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation PDF

Cannot Refute

[65] Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack PDF

Cannot Refute

[66] Enhancing the Reasoning Capabilities of Small Language Models via Solution Guidance Fine-Tuning PDF

Cannot Refute

[67] Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack PDF

Cannot Refute

[68] Convergent Linear Representations of Emergent Misalignment PDF

Cannot Refute

[69] RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval PDF

Cannot Refute

[70] Enhancing Fine-Tuning based Backdoor Defense with Sharpness-Aware Minimization PDF

Cannot Refute

Persona Features Control Emergent Misalignment

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Decomposing behavioral phase transitions in llms: Order parameters for emergent misalignment PDF

[29] Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs PDF

Contribution Analysis

Model-diffing approach using sparse autoencoders to identify misaligned persona features

[49] How Visual Representations Map to Language Feature Space in Multimodal LLMs PDF

[50] Interpretable LLM Guardrails via Sparse Representation Steering PDF

[51] Scratchpad Thinking: Alternation Between Storage and Computation in Latent Reasoning Models PDF

[52] REVIVING YOUR MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing PDF

[53] Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders PDF

[54] Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement PDF

[55] Enabling Sparse Autoencoders for Topic Alignment in Large Language Models PDF

[56] SparseMVC: Probing Cross-view Sparsity Variations for Multi-view Clustering PDF

Demonstration of emergent misalignment across diverse training conditions

[1] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs PDF

[12] Decomposing behavioral phase transitions in llms: Order parameters for emergent misalignment PDF

[15] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability PDF

[29] Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs PDF

[57] Plan to predict: Learning an uncertainty-foreseeing model for model-based reinforcement learning PDF

[58] Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation PDF

[59] In-Training Defenses against Emergent Misalignment in Language Models PDF

[60] When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment PDF

[61] TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs PDF

[62] A reinforcement learning-based framework for the generation and evolution of adaptation rules PDF

Emergent re-alignment via fine-tuning on small amounts of benign data

[1] Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs PDF

[15] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability PDF

[63] Fine-tuning aligned language models compromises safety, even when users do not intend to! PDF

[64] Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation PDF

[65] Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack PDF

[66] Enhancing the Reasoning Capabilities of Small Language Models via Solution Guidance Fine-Tuning PDF

[67] Lisa: Lazy safety alignment for large language models against harmful fine-tuning attack PDF

[68] Convergent Linear Representations of Emergent Misalignment PDF

[69] RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval PDF

[70] Enhancing Fine-Tuning based Backdoor Defense with Sharpness-Aware Minimization PDF

Table of Contents