Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Safety monitoringPolynomial classifiersInterpretability

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our anonymous code is available at https://anonymous.4open.science/r/tpc-anon-0708.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Truncated Polynomial Classifiers (TPCs) for dynamic safety monitoring of language model activations, enabling cost-adaptive guardrails that scale with input difficulty. Within the taxonomy, it resides in the 'Dynamic and Adaptive Monitoring Frameworks' leaf under 'Activation Monitoring and Classification'. This leaf contains only two papers total (including the original), indicating a relatively sparse research direction. The sibling work focuses on semantic drift detection, whereas this paper emphasizes progressive polynomial evaluation for flexible compute allocation.

The taxonomy reveals that most activation-based safety work clusters in adjacent leaves: 'Safety-Focused Activation Intervention' contains five papers addressing runtime steering and jailbreak mitigation, while 'Safety Neuron Identification' explores causal unit discovery. The original paper diverges from these by avoiding direct activation modification and neuron-level analysis. Instead, it bridges monitoring (classification without intervention) and adaptive frameworks (cost-sensitive detection). Neighboring branches like 'Training-Time Safety Restoration' and 'Input-Level Safeguarding' address orthogonal stages of the safety pipeline, reinforcing that activation monitoring during inference remains a distinct, less-crowded research area.

Among 26 candidates examined, the contribution-level analysis shows varied novelty signals. The core TPC mechanism (Contribution A: 9 candidates, 0 refutable) and progressive training scheme (Contribution B: 7 candidates, 0 refutable) appear relatively novel within the limited search scope. However, the dual evaluation modes—safety dial and adaptive cascade—(Contribution C: 10 candidates, 1 refutable) encounter at least one overlapping prior work among the examined papers. This suggests that while the polynomial formulation itself is distinctive, the concept of adjustable monitoring intensity has precedent in the top-30 semantic matches.

Given the sparse taxonomy leaf and limited search scope (26 candidates from semantic retrieval), the work appears to occupy a relatively underexplored niche within activation-based safety monitoring. The polynomial progression mechanism offers a novel angle compared to fixed-cost linear probes, though the broader idea of adaptive compute allocation has some prior exploration. A more exhaustive literature search beyond top-K semantic matches would be needed to definitively assess novelty across the entire field of dynamic safety monitoring.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Dynamic safety monitoring for language model activations. The field addresses how to detect and mitigate unsafe behaviors in large language models by examining their internal representations during inference. The taxonomy reveals five main branches: Activation-Based Safety Detection and Intervention focuses on real-time monitoring and steering of hidden states to identify or correct harmful outputs; Training-Time Safety Restoration and Alignment explores methods that adjust model weights or fine-tune safety mechanisms before deployment; Input-Level and Prompt-Based Safeguarding examines preprocessing and prompt engineering to prevent unsafe queries from reaching the model; Agent Testing and Real-World Deployment Safety considers evaluation frameworks and robustness checks for deployed systems; and Adaptive Contextualization and Memory Mechanisms investigates how models can dynamically adjust safety thresholds based on conversational context. Works such as Safety Neurons[1] and Neuron Safety Realignment[2] illustrate how researchers pinpoint specific activation patterns tied to safety, while Prompt Safeguarding[4] and Jailbreak Antidote[12] represent input-side defenses. A particularly active line of inquiry centers on whether to intervene at the activation level or during training. Activation-based approaches like Safety Conscious Steering[5] and SafeSwitch[7] offer the advantage of runtime adaptability, enabling models to respond to novel threats without retraining, yet they must balance detection accuracy with computational overhead. In contrast, training-time methods such as Synthetic Gradient Reservoirs[6] and Shape it Up[10] embed safety constraints more deeply but may struggle with emergent jailbreak strategies. Dynamic Safety Monitoring[0] sits squarely within the activation-monitoring cluster, emphasizing real-time classification and adaptive frameworks that adjust to semantic drift. Compared to Static approaches like Safety Neurons[1], which identify fixed neuron sets, Dynamic Safety Monitoring[0] and Dynamic Semantic Drift[3] prioritize continuous recalibration, reflecting the evolving nature of adversarial inputs and the need for context-sensitive intervention.

Claimed Contributions

Truncated Polynomial Classifiers for dynamic safety monitoring

9 retrieved papers

The authors propose TPCs as a method that extends linear probes by modeling higher-order interactions between LLM neurons. TPCs can be trained once and evaluated progressively at test-time by computing only a subset of polynomial terms, enabling flexible safety monitoring that scales with available compute.

9 retrieved papers

Progressive training scheme for nested sub-classifiers

7 retrieved papers

The authors develop a progressive training procedure that optimizes polynomial terms degree-by-degree rather than jointly. This ensures that truncated evaluations at lower degrees remain effective classifiers, enabling dynamic evaluation modes without sacrificing performance at partial depths.

7 retrieved papers

Two complementary evaluation modes for TPCs

Can Refute

10 retrieved papers

The authors introduce two ways to use TPCs: a safety dial mode where developers choose how many terms to evaluate based on desired guardrail strength, and an adaptive cascade mode where inputs exit early after low-order checks if confident, reserving higher-order terms only for ambiguous cases.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] Activation Monitoring: Advantages of Using Internal Representations for LLM Oversight PDF

O Patel, R Wang (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Truncated Polynomial Classifiers for dynamic safety monitoring

[24] Sparse Polynomial Optimisation for Neural Network Verification PDF

Cannot Refute

[25] Privacy-Preserving Machine Learning: ANN Activation Function Estimators for Homomorphic Encrypted Inference PDF

Cannot Refute

[26] Robustness verification of neural networks using polynomial optimization PDF

Cannot Refute

[27] Video Surveillance System-Based Human Activity Recognition Using Hierarchical Auto-Associative Polynomial Convolutional Neural Network with Garra Rufa Fish â¦ PDF

Cannot Refute

[28] Real-Time Safe Control of Neural Network Dynamic Models with Sound Approximation PDF

Cannot Refute

[29] Neural network verification using polynomial optimisation PDF

Cannot Refute

[30] Evolving polynomial neural networks for detecting abnormal patterns PDF

Cannot Refute

[31] Non-Linear Polynomial Approximations of the Sigmoid for Plain and Encrypted Models PDF

Cannot Refute

[32] BERN-NN-IBF: Enhancing Neural Network Bound Propagation Through Implicit Bernstein Form and Optimized Tensor Operations PDF

Cannot Refute

Contribution

Progressive training scheme for nested sub-classifiers

[33] Class-incremental learning via dual augmentation PDF

Cannot Refute

[34] Incremental feature selection for large-scale hierarchical classification with the arrival of new samples PDF

Cannot Refute

[35] Progressive convolutional neural network for incremental learning PDF

Cannot Refute

[36] DCL-SE: Dynamic Curriculum Learning for Spatiotemporal Encoding of Brain Imaging PDF

Cannot Refute

[37] Planning forward: Deep incremental hashing by gradually defrosting bits. PDF

Cannot Refute

[38] Feature modeling using polynomial classifiers and stepwise regression PDF

Cannot Refute

[39] Adaptive object recognition model using incremental feature representation and hierarchical classification PDF

Cannot Refute

Contribution

Two complementary evaluation modes for TPCs

[46] Adaptive neural networks for efficient inference PDF

Can Refute

[40] Glance and focus: a dynamic approach to reducing spatial redundancy in image classification PDF

Cannot Refute

[41] RapNet: resolution-adaptive and predictive early exit network for efficient image recognition PDF

Cannot Refute

[42] Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices PDF

Cannot Refute

[43] A Lightweight Cloud-Edge Collaborative Intelligence Inference Framework with Runtime Dynamic Optimization for Resource-Constrained Consumer Electronics PDF

Cannot Refute

[44] Adaptive inference through early-exit networks: Design, challenges and directions PDF

Cannot Refute

[45] Bert loses patience: Fast and robust inference with early exit PDF

Cannot Refute

[47] Dynamic early exiting predictive coding neural networks PDF

Cannot Refute

[48] Frameexit: Conditional early exiting for efficient video recognition PDF

Cannot Refute

[49] An Early Exit Deep Neural Network for Fast Inference Intrusion Detection PDF

Cannot Refute

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] Activation Monitoring: Advantages of Using Internal Representations for LLM Oversight PDF

Contribution Analysis

Truncated Polynomial Classifiers for dynamic safety monitoring

[24] Sparse Polynomial Optimisation for Neural Network Verification PDF

[25] Privacy-Preserving Machine Learning: ANN Activation Function Estimators for Homomorphic Encrypted Inference PDF

[26] Robustness verification of neural networks using polynomial optimization PDF

[27] Video Surveillance System-Based Human Activity Recognition Using Hierarchical Auto-Associative Polynomial Convolutional Neural Network with Garra Rufa Fish â¦ PDF

[28] Real-Time Safe Control of Neural Network Dynamic Models with Sound Approximation PDF

[29] Neural network verification using polynomial optimisation PDF

[30] Evolving polynomial neural networks for detecting abnormal patterns PDF

[31] Non-Linear Polynomial Approximations of the Sigmoid for Plain and Encrypted Models PDF

[32] BERN-NN-IBF: Enhancing Neural Network Bound Propagation Through Implicit Bernstein Form and Optimized Tensor Operations PDF

Progressive training scheme for nested sub-classifiers

[33] Class-incremental learning via dual augmentation PDF

[34] Incremental feature selection for large-scale hierarchical classification with the arrival of new samples PDF

[35] Progressive convolutional neural network for incremental learning PDF

[36] DCL-SE: Dynamic Curriculum Learning for Spatiotemporal Encoding of Brain Imaging PDF

[37] Planning forward: Deep incremental hashing by gradually defrosting bits. PDF

[38] Feature modeling using polynomial classifiers and stepwise regression PDF

[39] Adaptive object recognition model using incremental feature representation and hierarchical classification PDF

Two complementary evaluation modes for TPCs

[46] Adaptive neural networks for efficient inference PDF

[40] Glance and focus: a dynamic approach to reducing spatial redundancy in image classification PDF

[41] RapNet: resolution-adaptive and predictive early exit network for efficient image recognition PDF

[42] Joint Adaptive Resolution Selection and Conditional Early Exiting for Efficient Video Recognition on Edge Devices PDF

[43] A Lightweight Cloud-Edge Collaborative Intelligence Inference Framework with Runtime Dynamic Optimization for Resource-Constrained Consumer Electronics PDF

[44] Adaptive inference through early-exit networks: Design, challenges and directions PDF

[45] Bert loses patience: Fast and robust inference with early exit PDF

[47] Dynamic early exiting predictive coding neural networks PDF

[48] Frameexit: Conditional early exiting for efficient video recognition PDF

[49] An Early Exit Deep Neural Network for Fast Inference Intrusion Detection PDF

Table of Contents

[27] Video Surveillance System-Based Human Activity Recognition Using Hierarchical Auto-Associative Polynomial Convolutional Neural Network with Garra Rufa Fish â¦ PDF