The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

ICLR 2026 Conference SubmissionAnonymous Authors
Persistent HomologyInterpretabilityTopological Data AnalysisRepresentation GeometryLarge Language ModelsAI SecurityAdversarial AttacksSparse Autoencoders
Abstract:

Existing interpretability methods for Large Language Models (LLMs) often fall short by focusing on linear directions or isolated features, overlooking the high-dimensional, nonlinear, and relational geometry within model representations. This study focuses on how adversarial inputs systematically affect the internal representation spaces of LLMs, a topic which remains poorly understood. We propose the application of persistent homology (PH) to measure and understand the geometry and topology of the representation space when the model is under external adversarial influence. Specifically, we use PH to systematically interpret six state-of-the-art models under two distinct adversarial conditions—indirect prompt injection and backdoor fine-tuning—and uncover a consistent topological signature of adversarial influence. Across architectures and model sizes, adversarial inputs induce "topological compression'', where the latent space becomes structurally simpler, collapsing from varied, compact, small-scale features into fewer, dominant, and more dispersed large-scale ones. This topological signature is statistically robust across layers, highly discriminative, and provides interpretable insights into how adversarial effects emerge and propagate. By quantifying the shape of activations and neuron-level information flow, our architecture-agnostic framework reveals fundamental invariants of representational change, offering a complementary perspective to existing interpretability methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper applies persistent homology to characterize how adversarial inputs alter the topological structure of LLM latent spaces, discovering a 'topological compression' signature across six models under two attack types. It resides in the 'Persistent Homology for Adversarial Signature Detection' leaf, which contains only two papers total (including this one). This represents a sparse research direction within the broader taxonomy of 24 papers across roughly 11 leaf nodes, suggesting the specific intersection of persistent homology, adversarial detection, and language models remains relatively unexplored despite growing interest in topological analysis of neural representations.

The taxonomy reveals neighboring work in related but distinct directions. Sibling leaves within 'Adversarial Attack Detection' include multimodal alignment disruption, geometric explanations of universal attacks, high-dimensional manifold analysis, and graph-LLM vulnerability studies—all addressing adversarial phenomena but through different analytical lenses. A parallel branch, 'Topological Characterization of Language Model Representations,' contains five leaves examining BERT representations, bias detection, embedding manifolds, layer evolution, and structural perturbation sensitivity. The scope notes clarify that the original paper's focus on adversarial-induced topological signatures distinguishes it from general representation analysis (which excludes adversarial contexts) and from defense mechanisms (which emphasize robustness improvements rather than signature detection).

Among 27 candidates examined across three contributions, none were identified as clearly refuting the work. The first contribution (persistent homology for adversarial LLM analysis) examined 7 candidates with 0 refutable; the second (topological compression signature) examined 10 with 0 refutable; the third (neuron-level phase transitions) examined 10 with 0 refutable. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work appears to provide substantial overlap with the specific combination of persistent homology, adversarial influence, and topological compression in LLM latent spaces. The statistics indicate a focused literature search rather than exhaustive coverage, appropriate for assessing immediate novelty within examined candidates.

Based on the limited search of 27 candidates and the sparse taxonomy leaf (2 papers), the work appears to occupy a relatively novel position at the intersection of topological data analysis and adversarial LLM interpretability. The absence of refutable candidates across all three contributions suggests the specific methodological approach and empirical findings have not been directly anticipated in the examined literature, though the search scope does not rule out relevant work beyond top-K semantic matches or in adjacent research communities not captured by the taxonomy.

Taxonomy

Core-task Taxonomy Papers
24
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: topological analysis of adversarial influence on language model representations. This emerging field applies tools from topological data analysis—particularly persistent homology—to understand how adversarial perturbations alter the geometric and topological structure of neural representations in language models. The taxonomy reveals several main branches: one focuses on adversarial attack detection and characterization via topological methods, examining how adversarial examples leave distinctive topological signatures that can be detected through persistent homology or manifold analysis (e.g., Universal Attacks Geometry[2], Adversarial Training Topology[13]). Another branch characterizes the intrinsic topological properties of language model representations themselves, exploring how semantic structure manifests geometrically in embedding spaces (Topological BERT[4], BERTops[12]). Additional branches address topological methods for detecting LLM-generated content (LLM Detection Survey[5]), adversarial defense strategies that correct manifold distortions (Textual Manifold Defense[10], Manifold Purification[20]), foundational surveys of topological data analysis for neural networks (TDA Neural Survey[14], Computational Topology Neural[23]), multimodal and graph-based extensions (Topological Multimodal Adversaries[1], Graph-LLM Robustness[17]), and broader systematic reviews of language model capabilities. A particularly active line of work centers on using persistent homology to detect adversarial signatures in latent representations, where topological features such as holes or connected components reveal structural anomalies introduced by attacks. Adversarial Persistent Homology[0] sits squarely within this branch, closely aligned with Holes in Latent[16], which similarly examines topological voids in representation spaces as indicators of adversarial influence. Compared to broader characterization efforts like Topological BERT[4] or defense-oriented approaches such as Textual Manifold Defense[10], the original work emphasizes detection and signature extraction rather than general representation analysis or manifold correction. Key open questions across these branches include how topological invariants scale to high-dimensional language embeddings, whether topological signatures generalize across different attack types, and how to integrate topological insights into practical defense mechanisms.

Claimed Contributions

Application of persistent homology to characterize adversarial influence in LLM latent spaces

The authors introduce persistent homology as a method to analyze how adversarial inputs affect the internal representation spaces of large language models. This topological approach provides a coordinate-free, multi-scale characterization of latent space geometry that is robust to noise and captures nonlinear relational structures.

7 retrieved papers
Discovery of topological compression as a consistent signature of adversarial influence

The authors demonstrate that adversarial inputs consistently cause a specific geometric transformation across different models and attack types: the latent space shifts from diverse, compact structures to fewer, more dispersed topological features. This signature holds across architectures ranging from 7B to 70B parameters.

10 retrieved papers
Novel neuron-level persistent homology analysis revealing phase transitions in information flow

The authors develop a local analysis method that tracks neuron-level information flow between layers using 2D embeddings of activation patterns. This approach reveals how topological complexity evolves differently for clean versus adversarial inputs, showing a phase transition in deeper layers.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Application of persistent homology to characterize adversarial influence in LLM latent spaces

The authors introduce persistent homology as a method to analyze how adversarial inputs affect the internal representation spaces of large language models. This topological approach provides a coordinate-free, multi-scale characterization of latent space geometry that is robust to noise and captures nonlinear relational structures.

Contribution

Discovery of topological compression as a consistent signature of adversarial influence

The authors demonstrate that adversarial inputs consistently cause a specific geometric transformation across different models and attack types: the latent space shifts from diverse, compact structures to fewer, more dispersed topological features. This signature holds across architectures ranging from 7B to 70B parameters.

Contribution

Novel neuron-level persistent homology analysis revealing phase transitions in information flow

The authors develop a local analysis method that tracks neuron-level information flow between layers using 2D embeddings of activation patterns. This approach reveals how topological complexity evolves differently for clean versus adversarial inputs, showing a phase transition in deeper layers.