Obfuscated Activations Bypass LLM Latent-Space Defenses

ICLR 2026 Conference SubmissionAnonymous Authors
InterpretabilityAdversarial AttackJailbreakingSafety
Abstract:

Latent-space monitoring techniques have shown promise as defenses against LLM attacks. These defenses act as scanners to detect harmful activations before they lead to undesirable actions. This prompts the question: can models execute harmful behavior via inconspicuous latent states? Here, we study such obfuscated activations. Our results are nuanced. We show that state-of-the-art latent-space defenses---such as activation probes and latent OOD detection---are vulnerable to obfuscated activations. For example, against probes trained to classify harmfulness, our obfuscation attacks can reduce monitor recall from 100% down to 0% while still achieving a 90% jailbreaking success rate. However, we also find that certain probe architectures are more robust than others, and we discover the existence of an obfuscation tax: on a complex task (writing SQL code), evading monitors reduces model performance. Together, our results demonstrate white-box monitors are not robust to adversarial attack, while also providing concrete suggestions to alleviate, but not completely fix, this weakness.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper investigates adversarial attacks that generate obfuscated activations to evade latent-space monitoring defenses in LLMs. Within the taxonomy, it occupies the 'Obfuscated Activation Attacks' leaf under 'Adversarial Attacks on Latent-Space Defenses', where it is currently the sole paper. This positioning reflects a relatively sparse research direction focused specifically on activation obfuscation, contrasting with the more populated defense mechanisms branch (which contains multiple leaves spanning detection, steering, and adversarial training approaches). The work directly challenges the assumption that harmful behaviors leave detectable traces in hidden states.

The taxonomy reveals substantial activity in neighboring defense mechanisms. The 'Latent-Space Defense Mechanisms' branch contains three major categories: activation-based detection (including hidden state filtering and vision-language jailbreak detection), steering interventions (concept-based safety and controllable frameworks), and adversarial training methods (with four distinct leaves covering refusal robustness, calibration, feature-specific training, and bi-level optimization). The paper's attack methodology directly targets the first category, while the 'Theoretical Foundations' branch provides complementary analysis of safety-relevant subspaces and robustness measurement frameworks. Only one sibling leaf exists in the attacks branch: reasoning-style poisoning, which operates through document retrieval rather than activation manipulation.

Among thirty candidates examined, none clearly refute the three core contributions. The obfuscation attack methodology (ten candidates examined, zero refutable) appears novel in its systematic demonstration that state-of-the-art probes and OOD detectors can be evaded while maintaining high jailbreak success rates. The practical guidance for monitor deployment (ten candidates, zero refutable) and the obfuscation tax discovery (ten candidates, zero refutable) similarly show no substantial prior overlap within the limited search scope. The statistics suggest these contributions occupy relatively unexplored territory, though the modest candidate pool means the search captured top semantic matches rather than exhaustive coverage of all latent-space defense literature.

The analysis reflects a focused but limited literature search rather than comprehensive field coverage. The taxonomy structure indicates this work addresses a recognized gap—attacks specifically targeting latent monitors—in a field where defensive mechanisms have received more attention. The absence of refutable candidates across all contributions, combined with the sparse attack branch, suggests the work explores relatively fresh ground within the examined scope, though broader searches might reveal additional relevant prior work in adjacent areas like adversarial training or general jailbreak techniques.

Taxonomy

Core-task Taxonomy Papers
24
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: adversarial robustness of latent-space monitoring defenses in language models. The field structure reflects a natural division between defensive mechanisms that monitor or manipulate internal representations and adversarial strategies designed to circumvent them. The taxonomy organizes work into latent-space defense mechanisms (including probes, filters, and steering methods such as Refusal Feature Training[3] and Harmfulness Refusal Encoding[7]), adversarial attacks on these defenses (exploring obfuscation and evasion techniques like those in Obfuscated Activations Bypass[0]), theoretical foundations that characterize robustness properties (e.g., Robustness Detectability Privacy[14]), related latent-space applications (such as unlearning and concept editing via Steering Latent Unlearning[2] and Concept Enhancement Engineering[4]), and broader adversarial robustness context (connecting to general attack surfaces like Reasoning Style Poisoning[5]). This structure highlights how defenses that operate on hidden states must contend with attackers who can craft inputs to manipulate those same representations. A particularly active line of work examines whether latent-space monitors can reliably detect harmful content when adversaries deliberately obfuscate activations. Methods like HiddenDetect[8] and RepreGuard[12] aim to build robust classifiers over internal states, while Latent Adversarial Training[1] and Latent Adversarial Refusal[17] explore hardening these defenses through adversarial learning. In contrast, Obfuscated Activations Bypass[0] sits squarely within the attack branch, investigating how adversaries can evade monitoring by producing benign-looking latent representations that still yield harmful outputs. This work contrasts with defenses like Monitoring Latent States[11] and LatentGuard[20], which assume that harmful intent leaves detectable traces in hidden layers. The central tension is whether latent-space defenses can remain effective when attackers optimize directly against the monitoring mechanism, a question that also connects to theoretical characterizations of invariance and robustness such as Invariant Latent Space[15].

Claimed Contributions

Obfuscation attacks that bypass latent-space monitors

The authors develop obfuscation attack methods that jointly optimize for desired model behavior (such as jailbreaking or SQL generation) while fooling latent-space monitoring techniques. These attacks successfully reduce monitor recall from 100% to 0% while maintaining high attack success rates across various defense mechanisms.

10 retrieved papers
Practical guidance for deploying white-box monitors

Through systematic evaluation of different monitor architectures, the authors identify that certain designs (such as MLP-based and generation-based probes) are more robust than others (such as linear input-based probes), providing actionable recommendations for practitioners implementing latent-space defenses.

10 retrieved papers
Discovery of the obfuscation tax phenomenon

The authors demonstrate that while attackers can evade monitors on simple tasks, there exists a tradeoff between obfuscation and model performance on complex tasks like SQL code generation. This finding suggests fundamental incompatibility between achieving sophisticated model capabilities and maintaining obfuscated activations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Obfuscation attacks that bypass latent-space monitors

The authors develop obfuscation attack methods that jointly optimize for desired model behavior (such as jailbreaking or SQL generation) while fooling latent-space monitoring techniques. These attacks successfully reduce monitor recall from 100% to 0% while maintaining high attack success rates across various defense mechanisms.

Contribution

Practical guidance for deploying white-box monitors

Through systematic evaluation of different monitor architectures, the authors identify that certain designs (such as MLP-based and generation-based probes) are more robust than others (such as linear input-based probes), providing actionable recommendations for practitioners implementing latent-space defenses.

Contribution

Discovery of the obfuscation tax phenomenon

The authors demonstrate that while attackers can evade monitors on simple tasks, there exists a tradeoff between obfuscation and model performance on complex tasks like SQL code generation. This finding suggests fundamental incompatibility between achieving sophisticated model capabilities and maintaining obfuscated activations.