Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

automated interpretabilityLLM featuresstructured languages

Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regexes help people build accurate mental models of LLM features.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces semantic regexes, a structured language for describing LLM features through compositional primitives and modifiers. It resides in the Structured Feature Representation Frameworks leaf, which contains only this single paper within the broader Feature Evaluation and Validation branch. This positioning reflects a relatively sparse research direction focused on formal, structured alternatives to natural language feature descriptions. While the parent branch addresses feature validation more broadly, this specific leaf represents a novel approach to the representation problem itself rather than evaluation metrics or automated explanation generation.

The taxonomy reveals that most neighboring work pursues different validation strategies. The sibling leaf Automated Feature Description Generation contains papers like Automatically Interpreting Features and Neuron Descriptions that generate natural language explanations at scale, prioritizing automation over structural guarantees. Another sibling, Evaluation Without Explanations, bypasses linguistic descriptions entirely. The broader Feature Extraction branch focuses on discovery methods like sparse autoencoders, while Feature Application explores downstream uses of extracted features. Semantic regexes occupy a distinct niche by providing formal compositional structure for feature descriptions, bridging the gap between automated generation and rigorous representation.

Among twenty-nine candidates examined across three contributions, none clearly refute the core claims. The semantic regex language itself (ten candidates examined, zero refutable) appears novel as a structured formalism combining linguistic and semantic primitives. The primitives and modifiers framework (nine candidates, zero refutable) shows no direct prior work on this specific compositional approach. Model-wide complexity analysis using regex structure (ten candidates, zero refutable) likewise lacks clear precedent. This limited search scope suggests the structured regex formalism represents a genuinely unexplored direction, though the analysis cannot rule out relevant work outside the top-K semantic matches or citation network examined.

Based on this constrained literature search, the work appears to introduce a novel representational framework within a sparsely populated research direction. The absence of sibling papers in its taxonomy leaf and zero refutable candidates across contributions suggest substantive originality, though the twenty-nine-paper scope leaves open the possibility of overlooked related work in formal methods or program synthesis communities outside the core interpretability literature examined here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Automated interpretability of large language model features. The field has organized itself around several complementary branches that together address how to extract, validate, and apply interpretable features from LLMs. Feature Extraction and Decomposition Methods focus on techniques like sparse autoencoders (Sparse Autoencoders Features[6], Sparse Autoencoders Survey[18]) that decompose neural activations into more interpretable components. Feature Evaluation and Validation develops metrics and frameworks to assess whether extracted features are genuinely meaningful, while Feature Application and Analysis explores how these features can be used for tasks ranging from circuit discovery (Sparse Feature Circuits[13]) to domain-specific applications. Model Component Analysis examines specific architectural elements like attention heads or layers, and Comprehensive Interpretability Frameworks integrate multiple methods into unified systems. Theoretical Foundations provide conceptual grounding, while Interpretability for Model Control and Steering and Explainability and Downstream Applications translate insights into practical interventions and real-world uses. A particularly active tension exists between automated feature discovery methods and structured validation approaches. Many studies pursue scalable extraction techniques that can handle the vast dimensionality of modern LLMs, yet questions remain about how to systematically verify that discovered features correspond to meaningful semantic concepts. Semantic Regexes[0] sits within the Structured Feature Representation Frameworks cluster, emphasizing formal methods for representing and validating feature semantics—a contrast to purely automated approaches like Automatically Interpreting Features[3] that prioritize scalability over structured guarantees. This work shares common ground with efforts like SemanticLens[8] and Output-Centric Features[10] that also seek principled ways to characterize what features represent, but differs in its emphasis on regex-like compositional structures for feature descriptions. The broader challenge across these branches remains balancing automation with interpretability rigor.

Claimed Contributions

Semantic regexes: a structured language for LLM feature descriptions

10 retrieved papers

The authors propose semantic regexes, a structured language that describes LLM features by combining primitives (symbols, lexemes, fields) with modifiers (context, composition, quantification) to produce precise and expressive feature descriptions that are more concise and consistent than natural language.

10 retrieved papers

Primitives and modifiers for capturing linguistic and semantic patterns

9 retrieved papers

The authors develop a system of human-interpretable primitives (symbols for exact strings, lexemes for syntactic variants, fields for semantic categories) and modifiers (context, composition, quantification) that enable semantic regexes to express diverse feature activation patterns from simple token detectors to complex linguistic phenomena.

9 retrieved papers

Model-wide analysis of feature complexity using semantic regex structure

10 retrieved papers

The authors demonstrate that the inherent structure of semantic regexes enables new types of analyses, such as quantifying feature complexity across model layers by measuring the abstraction level and number of components in semantic regexes, thereby scaling interpretability from individual features to model-wide patterns.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Semantic regexes: a structured language for LLM feature descriptions

[71] Natural language descriptions of deep visual features PDF

Cannot Refute

[72] Foundations of symbolic languages for model interpretability PDF

Cannot Refute

[73] Causal abstractions of neural networks PDF

Cannot Refute

[74] Linguistic Interpretability of Transformer-based Language Models: a systematic review PDF

Cannot Refute

[75] Neurons to Words: A Novel Method for Automated Neural Network Interpretability and Alignment PDF

Cannot Refute

[76] Local interpretations for explainable natural language processing: A survey PDF

Cannot Refute

[77] An Interpretable Dynamic Inference System Based on Fuzzy Broad Learning PDF

Cannot Refute

[78] Enhancing Explainability and Accelerating Materials Science Design with Linguistic Summaries PDF

Cannot Refute

[79] Weighted automata extraction and explanation of recurrent neural networks for natural language tasks PDF

Cannot Refute

[80] Improving interpretability of deep neural networks with semantic information PDF

Cannot Refute

Contribution

Primitives and modifiers for capturing linguistic and semantic patterns

[61] Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations PDF

Cannot Refute

[62] Structural and semantic features of adjectives across languages and registers PDF

Cannot Refute

[63] Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation PDF

Cannot Refute

[64] Learning Structured Natural Language Representations for Semantic Parsing PDF

Cannot Refute

[65] Structural and semantic features of linguistic units of the English-language linguocultural scenario âProductsâ PDF

Cannot Refute

[66] Dialectal phraseological units of the Yakut language: structure and semantics PDF

Cannot Refute

[67] A preferential, pattern-seeking, semantics for natural language inference PDF

Cannot Refute

[69] Semantic construction in feature-based TAG PDF

Cannot Refute

[70] Using slots and modifiers in logic grammars for natural language PDF

Cannot Refute

Contribution

Model-wide analysis of feature complexity using semantic regex structure

[51] How transferable are features in deep neural networks? PDF

Cannot Refute

[52] Intrinsic dimension of data representations in deep neural networks PDF

Cannot Refute

[53] Formation of Representations in Neural Networks PDF

Cannot Refute

[54] Complexity of representations in deep learning PDF

Cannot Refute

[55] Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory PDF

Cannot Refute

[56] Hybrid parallel fuzzy CNN paradigm: Unmasking intricacies for accurate brain MRI insights PDF

Cannot Refute

[57] How deep is deep enough?--Quantifying class separability in the hidden layers of deep neural networks PDF

Cannot Refute

[58] Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras PDF

Cannot Refute

[59] Deep grokking: Would deep neural networks generalize better? PDF

Cannot Refute

[60] Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream PDF

Cannot Refute

Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Semantic regexes: a structured language for LLM feature descriptions

[71] Natural language descriptions of deep visual features PDF

[72] Foundations of symbolic languages for model interpretability PDF

[73] Causal abstractions of neural networks PDF

[74] Linguistic Interpretability of Transformer-based Language Models: a systematic review PDF

[75] Neurons to Words: A Novel Method for Automated Neural Network Interpretability and Alignment PDF

[76] Local interpretations for explainable natural language processing: A survey PDF

[77] An Interpretable Dynamic Inference System Based on Fuzzy Broad Learning PDF

[78] Enhancing Explainability and Accelerating Materials Science Design with Linguistic Summaries PDF

[79] Weighted automata extraction and explanation of recurrent neural networks for natural language tasks PDF

[80] Improving interpretability of deep neural networks with semantic information PDF

Primitives and modifiers for capturing linguistic and semantic patterns

[61] Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations PDF

[62] Structural and semantic features of adjectives across languages and registers PDF

[63] Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation PDF

[64] Learning Structured Natural Language Representations for Semantic Parsing PDF

[65] Structural and semantic features of linguistic units of the English-language linguocultural scenario âProductsâ PDF

[66] Dialectal phraseological units of the Yakut language: structure and semantics PDF

[67] A preferential, pattern-seeking, semantics for natural language inference PDF

[69] Semantic construction in feature-based TAG PDF

[70] Using slots and modifiers in logic grammars for natural language PDF

Model-wide analysis of feature complexity using semantic regex structure

[51] How transferable are features in deep neural networks? PDF

[52] Intrinsic dimension of data representations in deep neural networks PDF

[53] Formation of Representations in Neural Networks PDF

[54] Complexity of representations in deep learning PDF

[55] Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory PDF

[56] Hybrid parallel fuzzy CNN paradigm: Unmasking intricacies for accurate brain MRI insights PDF

[57] How deep is deep enough?--Quantifying class separability in the hidden layers of deep neural networks PDF

[58] Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras PDF

[59] Deep grokking: Would deep neural networks generalize better? PDF

[60] Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream PDF

Table of Contents

[65] Structural and semantic features of linguistic units of the English-language linguocultural scenario âProductsâ PDF