Obfuscated Activations Bypass LLM Latent-Space Defenses

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

InterpretabilityAdversarial AttackJailbreakingSafety

Latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners to detect harmful activations before they lead to undesirable actions. This prompts the question: can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. Our results are nuanced. We show that state-of-the-art latent-space defenses---such as activation probes and latent OOD detection---are vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our obfuscation attacks can reduce monitor recall from 100% down to 0% while still achieving a 90% jailbreaking success rate. However, we also find that certain probe architectures are more robust than others, and we discover the existence of an obfuscation tax: on a complex task (writing SQL code), evading monitors reduces model performance. Together, our results demonstrate white-box monitors are not robust to adversarial attack, while also providing concrete suggestions to alleviate, but not completely fix, this weakness.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper investigates adversarial attacks that generate obfuscated activations to evade latent-space monitoring defenses in LLMs. Within the taxonomy, it occupies the 'Obfuscated Activation Attacks' leaf under 'Adversarial Attacks on Latent-Space Defenses', where it is currently the sole paper. This positioning reflects a relatively sparse research direction focused specifically on activation obfuscation, contrasting with the more populated defense mechanisms branch (which contains multiple leaves spanning detection, steering, and adversarial training approaches). The work directly challenges the assumption that harmful behaviors leave detectable traces in hidden states.

The taxonomy reveals substantial activity in neighboring defense mechanisms. The 'Latent-Space Defense Mechanisms' branch contains three major categories: activation-based detection (including hidden state filtering and vision-language jailbreak detection), steering interventions (concept-based safety and controllable frameworks), and adversarial training methods (with four distinct leaves covering refusal robustness, calibration, feature-specific training, and bi-level optimization). The paper's attack methodology directly targets the first category, while the 'Theoretical Foundations' branch provides complementary analysis of safety-relevant subspaces and robustness measurement frameworks. Only one sibling leaf exists in the attacks branch: reasoning-style poisoning, which operates through document retrieval rather than activation manipulation.

Among thirty candidates examined, none clearly refute the three core contributions. The obfuscation attack methodology (ten candidates examined, zero refutable) appears novel in its systematic demonstration that state-of-the-art probes and OOD detectors can be evaded while maintaining high jailbreak success rates. The practical guidance for monitor deployment (ten candidates, zero refutable) and the obfuscation tax discovery (ten candidates, zero refutable) similarly show no substantial prior overlap within the limited search scope. The statistics suggest these contributions occupy relatively unexplored territory, though the modest candidate pool means the search captured top semantic matches rather than exhaustive coverage of all latent-space defense literature.

The analysis reflects a focused but limited literature search rather than comprehensive field coverage. The taxonomy structure indicates this work addresses a recognized gap—attacks specifically targeting latent monitors—in a field where defensive mechanisms have received more attention. The absence of refutable candidates across all contributions, combined with the sparse attack branch, suggests the work explores relatively fresh ground within the examined scope, though broader searches might reveal additional relevant prior work in adjacent areas like adversarial training or general jailbreak techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: adversarial robustness of latent-space monitoring defenses in language models. The field structure reflects a natural division between defensive mechanisms that monitor or manipulate internal representations and adversarial strategies designed to circumvent them. The taxonomy organizes work into latent-space defense mechanisms (including probes, filters, and steering methods such as Refusal Feature Training[3] and Harmfulness Refusal Encoding[7]), adversarial attacks on these defenses (exploring obfuscation and evasion techniques like those in Obfuscated Activations Bypass[0]), theoretical foundations that characterize robustness properties (e.g., Robustness Detectability Privacy[14]), related latent-space applications (such as unlearning and concept editing via Steering Latent Unlearning[2] and Concept Enhancement Engineering[4]), and broader adversarial robustness context (connecting to general attack surfaces like Reasoning Style Poisoning[5]). This structure highlights how defenses that operate on hidden states must contend with attackers who can craft inputs to manipulate those same representations. A particularly active line of work examines whether latent-space monitors can reliably detect harmful content when adversaries deliberately obfuscate activations. Methods like HiddenDetect[8] and RepreGuard[12] aim to build robust classifiers over internal states, while Latent Adversarial Training[1] and Latent Adversarial Refusal[17] explore hardening these defenses through adversarial learning. In contrast, Obfuscated Activations Bypass[0] sits squarely within the attack branch, investigating how adversaries can evade monitoring by producing benign-looking latent representations that still yield harmful outputs. This work contrasts with defenses like Monitoring Latent States[11] and LatentGuard[20], which assume that harmful intent leaves detectable traces in hidden layers. The central tension is whether latent-space defenses can remain effective when attackers optimize directly against the monitoring mechanism, a question that also connects to theoretical characterizations of invariance and robustness such as Invariant Latent Space[15].

Claimed Contributions

Obfuscation attacks that bypass latent-space monitors

10 retrieved papers

The authors develop obfuscation attack methods that jointly optimize for desired model behavior (such as jailbreaking or SQL generation) while fooling latent-space monitoring techniques. These attacks successfully reduce monitor recall from 100% to 0% while maintaining high attack success rates across various defense mechanisms.

10 retrieved papers

Practical guidance for deploying white-box monitors

10 retrieved papers

Through systematic evaluation of different monitor architectures, the authors identify that certain designs (such as MLP-based and generation-based probes) are more robust than others (such as linear input-based probes), providing actionable recommendations for practitioners implementing latent-space defenses.

10 retrieved papers

Discovery of the obfuscation tax phenomenon

10 retrieved papers

The authors demonstrate that while attackers can evade monitors on simple tasks, there exists a tradeoff between obfuscation and model performance on complex tasks like SQL code generation. This finding suggests fundamental incompatibility between achieving sophisticated model capabilities and maintaining obfuscated activations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Obfuscation attacks that bypass latent-space monitors

[16] Boost Off/On-Manifold Adversarial Robustness for Deep Learning with Latent Representation Mixup PDF

Cannot Refute

[17] Latent Adversarial Training Improves the Representation of Refusal PDF

Cannot Refute

[25] Statement-Level Adversarial Attack on Vulnerability Detection Models via Out-of-Distribution Features PDF

Cannot Refute

[26] Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks PDF

Cannot Refute

[27] Adversarial XAI methods in cybersecurity PDF

Cannot Refute

[28] Generating semantic adversarial examples via feature manipulation in latent space PDF

Cannot Refute

[29] Robust out-of-distribution detection for neural networks PDF

Cannot Refute

[30] Deceptive Risk Minimization: Out-of-Distribution Generalization by Deceiving Distribution Shift Detectors PDF

Cannot Refute

[31] Evalda: Efficient evasion attacks towards latent Dirichlet allocation PDF

Cannot Refute

[32] LRCM: Enhancing Adversarial Purification through Latent Representation Compression PDF

Cannot Refute

Contribution

Practical guidance for deploying white-box monitors

[43] A Robustness-Assured White-Box Watermark in Neural Networks PDF

Cannot Refute

[44] Securing smart grid false data detectors against white-box evasion attacks without sacrificing accuracy PDF

Cannot Refute

[45] Reliable and Accurate Fault Detection with GPGPUs and LLVM PDF

Cannot Refute

[46] Unelicitable Backdoors via Cryptographic Transformer Circuits PDF

Cannot Refute

[47] Robust and Undetectable White-Box Watermarks for Deep Neural Networks PDF

Cannot Refute

[48] Improving the robustness of industrial CyberâPhysical Systems through machine learning-based performance anomaly identification PDF

Cannot Refute

[49] Runtime Assurance for Intelligent Cyber-Physical Systems PDF

Cannot Refute

[50] Fostering The Robustness Of White-Box Deep Neural Network Watermarks By Neuron Alignment PDF

Cannot Refute

[51] SRE with Java Microservices PDF

Cannot Refute

[52] Toward Secure In-Sensor Intelligence: Threats and Defenses in SNNs PDF

Cannot Refute

Contribution

Discovery of the obfuscation tax phenomenon

[33] Evasion and Hardening of Tree Ensemble Classifiers PDF

Cannot Refute

[34] Hiding Visual Information via Obfuscating Adversarial Perturbations PDF

Cannot Refute

[35] MalPurifier: Enhancing Android malware detection with adversarial purification against evasion attacks PDF

Cannot Refute

[36] Talking Like a Phisher: LLM-based attacks on voice phishing classifiers PDF

Cannot Refute

[37] Privacy-Net: An Adversarial Approach for Identity-Obfuscated Segmentation of Medical Images PDF

Cannot Refute

[38] Why do firms evade taxes? The role of information sharing and financial sector outreach PDF

Cannot Refute

[39] Avoiding Occupancy Detection From Smart Meter Using Adversarial Machine Learning PDF

Cannot Refute

[40] Universal Evasion Attacks on Summarization Scoring PDF

Cannot Refute

[41] Characterizing Internal Evasion Attacks in Federated Learning PDF

Cannot Refute

[42] Adversarial Robustness Unhardening via Backdoor Attacks in Federated Learning PDF

Cannot Refute

Obfuscated Activations Bypass LLM Latent-Space Defenses

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Obfuscation attacks that bypass latent-space monitors

[16] Boost Off/On-Manifold Adversarial Robustness for Deep Learning with Latent Representation Mixup PDF

[17] Latent Adversarial Training Improves the Representation of Refusal PDF

[25] Statement-Level Adversarial Attack on Vulnerability Detection Models via Out-of-Distribution Features PDF

[26] Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks PDF

[27] Adversarial XAI methods in cybersecurity PDF

[28] Generating semantic adversarial examples via feature manipulation in latent space PDF

[29] Robust out-of-distribution detection for neural networks PDF

[30] Deceptive Risk Minimization: Out-of-Distribution Generalization by Deceiving Distribution Shift Detectors PDF

[31] Evalda: Efficient evasion attacks towards latent Dirichlet allocation PDF

[32] LRCM: Enhancing Adversarial Purification through Latent Representation Compression PDF

Practical guidance for deploying white-box monitors

[43] A Robustness-Assured White-Box Watermark in Neural Networks PDF

[44] Securing smart grid false data detectors against white-box evasion attacks without sacrificing accuracy PDF

[45] Reliable and Accurate Fault Detection with GPGPUs and LLVM PDF

[46] Unelicitable Backdoors via Cryptographic Transformer Circuits PDF

[47] Robust and Undetectable White-Box Watermarks for Deep Neural Networks PDF

[48] Improving the robustness of industrial CyberâPhysical Systems through machine learning-based performance anomaly identification PDF

[49] Runtime Assurance for Intelligent Cyber-Physical Systems PDF

[50] Fostering The Robustness Of White-Box Deep Neural Network Watermarks By Neuron Alignment PDF

[51] SRE with Java Microservices PDF

[52] Toward Secure In-Sensor Intelligence: Threats and Defenses in SNNs PDF

Discovery of the obfuscation tax phenomenon

[33] Evasion and Hardening of Tree Ensemble Classifiers PDF

[34] Hiding Visual Information via Obfuscating Adversarial Perturbations PDF

[35] MalPurifier: Enhancing Android malware detection with adversarial purification against evasion attacks PDF

[36] Talking Like a Phisher: LLM-based attacks on voice phishing classifiers PDF

[37] Privacy-Net: An Adversarial Approach for Identity-Obfuscated Segmentation of Medical Images PDF

[38] Why do firms evade taxes? The role of information sharing and financial sector outreach PDF

[39] Avoiding Occupancy Detection From Smart Meter Using Adversarial Machine Learning PDF

[40] Universal Evasion Attacks on Summarization Scoring PDF

[41] Characterizing Internal Evasion Attacks in Federated Learning PDF

[42] Adversarial Robustness Unhardening via Backdoor Attacks in Federated Learning PDF

Table of Contents

[48] Improving the robustness of industrial CyberâPhysical Systems through machine learning-based performance anomaly identification PDF