The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Persistent HomologyInterpretabilityTopological Data AnalysisRepresentation GeometryLarge Language ModelsAI SecurityAdversarial AttacksSparse Autoencoders

Existing interpretability methods for Large Language Models (LLMs) often fall short by focusing on linear directions or isolated features, overlooking the high-dimensional, nonlinear, and relational geometry within model representations. This study focuses on how adversarial inputs systematically affect the internal representation spaces of LLMs, a topic which remains poorly understood. We propose the application of persistent homology (PH) to measure and understand the geometry and topology of the representation space when the model is under external adversarial influence. Specifically, we use PH to systematically interpret six state-of-the-art models under two distinct adversarial conditions—indirect prompt injection and backdoor fine-tuning—and uncover a consistent topological signature of adversarial influence. Across architectures and model sizes, adversarial inputs induce "topological compression'', where the latent space becomes structurally simpler, collapsing from varied, compact, small-scale features into fewer, dominant, and more dispersed large-scale ones. This topological signature is statistically robust across layers, highly discriminative, and provides interpretable insights into how adversarial effects emerge and propagate. By quantifying the shape of activations and neuron-level information flow, our architecture-agnostic framework reveals fundamental invariants of representational change, offering a complementary perspective to existing interpretability methods.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper applies persistent homology to characterize how adversarial inputs alter the topological structure of LLM latent spaces, discovering a 'topological compression' signature across six models under two attack types. It resides in the 'Persistent Homology for Adversarial Signature Detection' leaf, which contains only two papers total (including this one). This represents a sparse research direction within the broader taxonomy of 24 papers across roughly 11 leaf nodes, suggesting the specific intersection of persistent homology, adversarial detection, and language models remains relatively unexplored despite growing interest in topological analysis of neural representations.

The taxonomy reveals neighboring work in related but distinct directions. Sibling leaves within 'Adversarial Attack Detection' include multimodal alignment disruption, geometric explanations of universal attacks, high-dimensional manifold analysis, and graph-LLM vulnerability studies—all addressing adversarial phenomena but through different analytical lenses. A parallel branch, 'Topological Characterization of Language Model Representations,' contains five leaves examining BERT representations, bias detection, embedding manifolds, layer evolution, and structural perturbation sensitivity. The scope notes clarify that the original paper's focus on adversarial-induced topological signatures distinguishes it from general representation analysis (which excludes adversarial contexts) and from defense mechanisms (which emphasize robustness improvements rather than signature detection).

Among 27 candidates examined across three contributions, none were identified as clearly refuting the work. The first contribution (persistent homology for adversarial LLM analysis) examined 7 candidates with 0 refutable; the second (topological compression signature) examined 10 with 0 refutable; the third (neuron-level phase transitions) examined 10 with 0 refutable. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work appears to provide substantial overlap with the specific combination of persistent homology, adversarial influence, and topological compression in LLM latent spaces. The statistics indicate a focused literature search rather than exhaustive coverage, appropriate for assessing immediate novelty within examined candidates.

Based on the limited search of 27 candidates and the sparse taxonomy leaf (2 papers), the work appears to occupy a relatively novel position at the intersection of topological data analysis and adversarial LLM interpretability. The absence of refutable candidates across all three contributions suggests the specific methodological approach and empirical findings have not been directly anticipated in the examined literature, though the search scope does not rule out relevant work beyond top-K semantic matches or in adjacent research communities not captured by the taxonomy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: topological analysis of adversarial influence on language model representations. This emerging field applies tools from topological data analysis—particularly persistent homology—to understand how adversarial perturbations alter the geometric and topological structure of neural representations in language models. The taxonomy reveals several main branches: one focuses on adversarial attack detection and characterization via topological methods, examining how adversarial examples leave distinctive topological signatures that can be detected through persistent homology or manifold analysis (e.g., Universal Attacks Geometry[2], Adversarial Training Topology[13]). Another branch characterizes the intrinsic topological properties of language model representations themselves, exploring how semantic structure manifests geometrically in embedding spaces (Topological BERT[4], BERTops[12]). Additional branches address topological methods for detecting LLM-generated content (LLM Detection Survey[5]), adversarial defense strategies that correct manifold distortions (Textual Manifold Defense[10], Manifold Purification[20]), foundational surveys of topological data analysis for neural networks (TDA Neural Survey[14], Computational Topology Neural[23]), multimodal and graph-based extensions (Topological Multimodal Adversaries[1], Graph-LLM Robustness[17]), and broader systematic reviews of language model capabilities. A particularly active line of work centers on using persistent homology to detect adversarial signatures in latent representations, where topological features such as holes or connected components reveal structural anomalies introduced by attacks. Adversarial Persistent Homology[0] sits squarely within this branch, closely aligned with Holes in Latent[16], which similarly examines topological voids in representation spaces as indicators of adversarial influence. Compared to broader characterization efforts like Topological BERT[4] or defense-oriented approaches such as Textual Manifold Defense[10], the original work emphasizes detection and signature extraction rather than general representation analysis or manifold correction. Key open questions across these branches include how topological invariants scale to high-dimensional language embeddings, whether topological signatures generalize across different attack types, and how to integrate topological insights into practical defense mechanisms.

Claimed Contributions

Application of persistent homology to characterize adversarial influence in LLM latent spaces

7 retrieved papers

The authors introduce persistent homology as a method to analyze how adversarial inputs affect the internal representation spaces of large language models. This topological approach provides a coordinate-free, multi-scale characterization of latent space geometry that is robust to noise and captures nonlinear relational structures.

7 retrieved papers

Discovery of topological compression as a consistent signature of adversarial influence

10 retrieved papers

The authors demonstrate that adversarial inputs consistently cause a specific geometric transformation across different models and attack types: the latent space shifts from diverse, compact structures to fewer, more dispersed topological features. This signature holds across architectures ranging from 7B to 70B parameters.

10 retrieved papers

Novel neuron-level persistent homology analysis revealing phase transitions in information flow

10 retrieved papers

The authors develop a local analysis method that tracks neuron-level information flow between layers using 2D embeddings of activation patterns. This approach reveals how topological complexity evolves differently for clean versus adversarial inputs, showing a phase transition in deeper layers.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Holes in Latent Space: Topological Signatures Under Adversarial Influence PDF

A Fay, I GarcÃa-Redondo, Q Wang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Application of persistent homology to characterize adversarial influence in LLM latent spaces

[4] The topological BERT: Transforming attention into topology for natural language processing PDF

Cannot Refute

[12] Bertops: Studying bert representations under a topological lens PDF

Cannot Refute

[13] Visualizing and analyzing the topology of neuron activations in deep adversarial training PDF

Cannot Refute

[16] Holes in Latent Space: Topological Signatures Under Adversarial Influence PDF

Cannot Refute

[20] Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification PDF

Cannot Refute

[24] Topological Data Analysis for Neural Networks PDF

Cannot Refute

[35] HOLE: Homological Observation of Latent Embeddings for Neural Network Interpretability PDF

Cannot Refute

Contribution

Discovery of topological compression as a consistent signature of adversarial influence

[36] Topology preserving compositionality for robust medical image segmentation PDF

Cannot Refute

[37] Defense against adversarial attacks using topology aligning adversarial training PDF

Cannot Refute

[38] Deep Neural Network Topology Optimization Against Neural Attacks PDF

Cannot Refute

[39] When witnesses defend: A witness graph topological layer for adversarial graph learning PDF

Cannot Refute

[40] Topology-preserving Adversarial Training for Alleviating Natural Accuracy Degradation PDF

Cannot Refute

[41] Topology-Preserving Adversarial Training PDF

Cannot Refute

[42] Adversarial machine learning in latent representations of neural networks PDF

Cannot Refute

[43] Graph Structure Reshaping Against Adversarial Attacks on Graph Neural Networks PDF

Cannot Refute

[44] A self-organizing clustering system for unsupervised distribution shift detection PDF

Cannot Refute

[45] UAP Attack for Radio Signal Modulation Classification Based on Deep Learning PDF

Cannot Refute

Contribution

Novel neuron-level persistent homology analysis revealing phase transitions in information flow

[25] Linking network-and neuron-level correlations by renormalized field theory PDF

Cannot Refute

[26] Universal critical dynamics in high resolution neuronal avalanche data PDF

Cannot Refute

[27] Chimera states and disordered diffusion in the coupled neural networks utilizing a simplified memristive neuron PDF

Cannot Refute

[28] Structure level adaptation for artificial neural networks PDF

Cannot Refute

[29] The success of complex networks at criticality PDF

Cannot Refute

[30] Explosive, continuous and frustrated synchronization transition in spiking HodgkinâHuxley neural networks: The role of topology and synaptic interaction PDF

Cannot Refute

[31] Superposition in Graph Neural Networks PDF

Cannot Refute

[32] Single-neuron criticality optimizes analog dendritic computation PDF

Cannot Refute

[33] Phase transitions in the neuropercolation model of neural populations with mixed local and non-local interactions PDF

Cannot Refute

[34] Fixed Points as Regulatory Hubs in Discrete Memristive Neural Networks: An Analysis of the FitzHugh-Nagumo Model PDF

Cannot Refute

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Holes in Latent Space: Topological Signatures Under Adversarial Influence PDF

Contribution Analysis

Application of persistent homology to characterize adversarial influence in LLM latent spaces

[4] The topological BERT: Transforming attention into topology for natural language processing PDF

[12] Bertops: Studying bert representations under a topological lens PDF

[13] Visualizing and analyzing the topology of neuron activations in deep adversarial training PDF

[16] Holes in Latent Space: Topological Signatures Under Adversarial Influence PDF

[20] Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification PDF

[24] Topological Data Analysis for Neural Networks PDF

[35] HOLE: Homological Observation of Latent Embeddings for Neural Network Interpretability PDF

Discovery of topological compression as a consistent signature of adversarial influence

[36] Topology preserving compositionality for robust medical image segmentation PDF

[37] Defense against adversarial attacks using topology aligning adversarial training PDF

[38] Deep Neural Network Topology Optimization Against Neural Attacks PDF

[39] When witnesses defend: A witness graph topological layer for adversarial graph learning PDF

[40] Topology-preserving Adversarial Training for Alleviating Natural Accuracy Degradation PDF

[41] Topology-Preserving Adversarial Training PDF

[42] Adversarial machine learning in latent representations of neural networks PDF

[43] Graph Structure Reshaping Against Adversarial Attacks on Graph Neural Networks PDF

[44] A self-organizing clustering system for unsupervised distribution shift detection PDF

[45] UAP Attack for Radio Signal Modulation Classification Based on Deep Learning PDF

Novel neuron-level persistent homology analysis revealing phase transitions in information flow

[25] Linking network-and neuron-level correlations by renormalized field theory PDF

[26] Universal critical dynamics in high resolution neuronal avalanche data PDF

[27] Chimera states and disordered diffusion in the coupled neural networks utilizing a simplified memristive neuron PDF

[28] Structure level adaptation for artificial neural networks PDF

[29] The success of complex networks at criticality PDF

[30] Explosive, continuous and frustrated synchronization transition in spiking HodgkinâHuxley neural networks: The role of topology and synaptic interaction PDF

[31] Superposition in Graph Neural Networks PDF

[32] Single-neuron criticality optimizes analog dendritic computation PDF

[33] Phase transitions in the neuropercolation model of neural populations with mixed local and non-local interactions PDF

[34] Fixed Points as Regulatory Hubs in Discrete Memristive Neural Networks: An Analysis of the FitzHugh-Nagumo Model PDF

Table of Contents

[30] Explosive, continuous and frustrated synchronization transition in spiking HodgkinâHuxley neural networks: The role of topology and synaptic interaction PDF