Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Safety monitoringPolynomial classifiersInterpretability
Abstract:

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our anonymous code is available at https://anonymous.4open.science/r/tpc-anon-0708.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Truncated Polynomial Classifiers (TPCs) for dynamic safety monitoring of language model activations, enabling cost-adaptive guardrails that scale with input difficulty. Within the taxonomy, it resides in the 'Dynamic and Adaptive Monitoring Frameworks' leaf under 'Activation Monitoring and Classification'. This leaf contains only two papers total (including the original), indicating a relatively sparse research direction. The sibling work focuses on semantic drift detection, whereas this paper emphasizes progressive polynomial evaluation for flexible compute allocation.

The taxonomy reveals that most activation-based safety work clusters in adjacent leaves: 'Safety-Focused Activation Intervention' contains five papers addressing runtime steering and jailbreak mitigation, while 'Safety Neuron Identification' explores causal unit discovery. The original paper diverges from these by avoiding direct activation modification and neuron-level analysis. Instead, it bridges monitoring (classification without intervention) and adaptive frameworks (cost-sensitive detection). Neighboring branches like 'Training-Time Safety Restoration' and 'Input-Level Safeguarding' address orthogonal stages of the safety pipeline, reinforcing that activation monitoring during inference remains a distinct, less-crowded research area.

Among 26 candidates examined, the contribution-level analysis shows varied novelty signals. The core TPC mechanism (Contribution A: 9 candidates, 0 refutable) and progressive training scheme (Contribution B: 7 candidates, 0 refutable) appear relatively novel within the limited search scope. However, the dual evaluation modes—safety dial and adaptive cascade—(Contribution C: 10 candidates, 1 refutable) encounter at least one overlapping prior work among the examined papers. This suggests that while the polynomial formulation itself is distinctive, the concept of adjustable monitoring intensity has precedent in the top-30 semantic matches.

Given the sparse taxonomy leaf and limited search scope (26 candidates from semantic retrieval), the work appears to occupy a relatively underexplored niche within activation-based safety monitoring. The polynomial progression mechanism offers a novel angle compared to fixed-cost linear probes, though the broader idea of adaptive compute allocation has some prior exploration. A more exhaustive literature search beyond top-K semantic matches would be needed to definitively assess novelty across the entire field of dynamic safety monitoring.

Taxonomy

Core-task Taxonomy Papers
23
3
Claimed Contributions
26
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Dynamic safety monitoring for language model activations. The field addresses how to detect and mitigate unsafe behaviors in large language models by examining their internal representations during inference. The taxonomy reveals five main branches: Activation-Based Safety Detection and Intervention focuses on real-time monitoring and steering of hidden states to identify or correct harmful outputs; Training-Time Safety Restoration and Alignment explores methods that adjust model weights or fine-tune safety mechanisms before deployment; Input-Level and Prompt-Based Safeguarding examines preprocessing and prompt engineering to prevent unsafe queries from reaching the model; Agent Testing and Real-World Deployment Safety considers evaluation frameworks and robustness checks for deployed systems; and Adaptive Contextualization and Memory Mechanisms investigates how models can dynamically adjust safety thresholds based on conversational context. Works such as Safety Neurons[1] and Neuron Safety Realignment[2] illustrate how researchers pinpoint specific activation patterns tied to safety, while Prompt Safeguarding[4] and Jailbreak Antidote[12] represent input-side defenses. A particularly active line of inquiry centers on whether to intervene at the activation level or during training. Activation-based approaches like Safety Conscious Steering[5] and SafeSwitch[7] offer the advantage of runtime adaptability, enabling models to respond to novel threats without retraining, yet they must balance detection accuracy with computational overhead. In contrast, training-time methods such as Synthetic Gradient Reservoirs[6] and Shape it Up[10] embed safety constraints more deeply but may struggle with emergent jailbreak strategies. Dynamic Safety Monitoring[0] sits squarely within the activation-monitoring cluster, emphasizing real-time classification and adaptive frameworks that adjust to semantic drift. Compared to Static approaches like Safety Neurons[1], which identify fixed neuron sets, Dynamic Safety Monitoring[0] and Dynamic Semantic Drift[3] prioritize continuous recalibration, reflecting the evolving nature of adversarial inputs and the need for context-sensitive intervention.

Claimed Contributions

Truncated Polynomial Classifiers for dynamic safety monitoring

The authors propose TPCs as a method that extends linear probes by modeling higher-order interactions between LLM neurons. TPCs can be trained once and evaluated progressively at test-time by computing only a subset of polynomial terms, enabling flexible safety monitoring that scales with available compute.

9 retrieved papers
Progressive training scheme for nested sub-classifiers

The authors develop a progressive training procedure that optimizes polynomial terms degree-by-degree rather than jointly. This ensures that truncated evaluations at lower degrees remain effective classifiers, enabling dynamic evaluation modes without sacrificing performance at partial depths.

7 retrieved papers
Two complementary evaluation modes for TPCs

The authors introduce two ways to use TPCs: a safety dial mode where developers choose how many terms to evaluate based on desired guardrail strength, and an adaptive cascade mode where inputs exit early after low-order checks if confident, reserving higher-order terms only for ambiguous cases.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Truncated Polynomial Classifiers for dynamic safety monitoring

The authors propose TPCs as a method that extends linear probes by modeling higher-order interactions between LLM neurons. TPCs can be trained once and evaluated progressively at test-time by computing only a subset of polynomial terms, enabling flexible safety monitoring that scales with available compute.

Contribution

Progressive training scheme for nested sub-classifiers

The authors develop a progressive training procedure that optimizes polynomial terms degree-by-degree rather than jointly. This ensures that truncated evaluations at lower degrees remain effective classifiers, enabling dynamic evaluation modes without sacrificing performance at partial depths.

Contribution

Two complementary evaluation modes for TPCs

The authors introduce two ways to use TPCs: a safety dial mode where developers choose how many terms to evaluate based on desired guardrail strength, and an adaptive cascade mode where inputs exit early after low-order checks if confident, reserving higher-order terms only for ambiguous cases.