Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

ICLR 2026 Conference SubmissionAnonymous Authors
automated interpretabilityLLM featuresstructured languages
Abstract:

Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regexes help people build accurate mental models of LLM features.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces semantic regexes, a structured language for describing LLM features through compositional primitives and modifiers. It resides in the Structured Feature Representation Frameworks leaf, which contains only this single paper within the broader Feature Evaluation and Validation branch. This positioning reflects a relatively sparse research direction focused on formal, structured alternatives to natural language feature descriptions. While the parent branch addresses feature validation more broadly, this specific leaf represents a novel approach to the representation problem itself rather than evaluation metrics or automated explanation generation.

The taxonomy reveals that most neighboring work pursues different validation strategies. The sibling leaf Automated Feature Description Generation contains papers like Automatically Interpreting Features and Neuron Descriptions that generate natural language explanations at scale, prioritizing automation over structural guarantees. Another sibling, Evaluation Without Explanations, bypasses linguistic descriptions entirely. The broader Feature Extraction branch focuses on discovery methods like sparse autoencoders, while Feature Application explores downstream uses of extracted features. Semantic regexes occupy a distinct niche by providing formal compositional structure for feature descriptions, bridging the gap between automated generation and rigorous representation.

Among twenty-nine candidates examined across three contributions, none clearly refute the core claims. The semantic regex language itself (ten candidates examined, zero refutable) appears novel as a structured formalism combining linguistic and semantic primitives. The primitives and modifiers framework (nine candidates, zero refutable) shows no direct prior work on this specific compositional approach. Model-wide complexity analysis using regex structure (ten candidates, zero refutable) likewise lacks clear precedent. This limited search scope suggests the structured regex formalism represents a genuinely unexplored direction, though the analysis cannot rule out relevant work outside the top-K semantic matches or citation network examined.

Based on this constrained literature search, the work appears to introduce a novel representational framework within a sparsely populated research direction. The absence of sibling papers in its taxonomy leaf and zero refutable candidates across contributions suggest substantive originality, though the twenty-nine-paper scope leaves open the possibility of overlooked related work in formal methods or program synthesis communities outside the core interpretability literature examined here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Automated interpretability of large language model features. The field has organized itself around several complementary branches that together address how to extract, validate, and apply interpretable features from LLMs. Feature Extraction and Decomposition Methods focus on techniques like sparse autoencoders (Sparse Autoencoders Features[6], Sparse Autoencoders Survey[18]) that decompose neural activations into more interpretable components. Feature Evaluation and Validation develops metrics and frameworks to assess whether extracted features are genuinely meaningful, while Feature Application and Analysis explores how these features can be used for tasks ranging from circuit discovery (Sparse Feature Circuits[13]) to domain-specific applications. Model Component Analysis examines specific architectural elements like attention heads or layers, and Comprehensive Interpretability Frameworks integrate multiple methods into unified systems. Theoretical Foundations provide conceptual grounding, while Interpretability for Model Control and Steering and Explainability and Downstream Applications translate insights into practical interventions and real-world uses. A particularly active tension exists between automated feature discovery methods and structured validation approaches. Many studies pursue scalable extraction techniques that can handle the vast dimensionality of modern LLMs, yet questions remain about how to systematically verify that discovered features correspond to meaningful semantic concepts. Semantic Regexes[0] sits within the Structured Feature Representation Frameworks cluster, emphasizing formal methods for representing and validating feature semantics—a contrast to purely automated approaches like Automatically Interpreting Features[3] that prioritize scalability over structured guarantees. This work shares common ground with efforts like SemanticLens[8] and Output-Centric Features[10] that also seek principled ways to characterize what features represent, but differs in its emphasis on regex-like compositional structures for feature descriptions. The broader challenge across these branches remains balancing automation with interpretability rigor.

Claimed Contributions

Semantic regexes: a structured language for LLM feature descriptions

The authors propose semantic regexes, a structured language that describes LLM features by combining primitives (symbols, lexemes, fields) with modifiers (context, composition, quantification) to produce precise and expressive feature descriptions that are more concise and consistent than natural language.

10 retrieved papers
Primitives and modifiers for capturing linguistic and semantic patterns

The authors develop a system of human-interpretable primitives (symbols for exact strings, lexemes for syntactic variants, fields for semantic categories) and modifiers (context, composition, quantification) that enable semantic regexes to express diverse feature activation patterns from simple token detectors to complex linguistic phenomena.

9 retrieved papers
Model-wide analysis of feature complexity using semantic regex structure

The authors demonstrate that the inherent structure of semantic regexes enables new types of analyses, such as quantifying feature complexity across model layers by measuring the abstraction level and number of components in semantic regexes, thereby scaling interpretability from individual features to model-wide patterns.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Semantic regexes: a structured language for LLM feature descriptions

The authors propose semantic regexes, a structured language that describes LLM features by combining primitives (symbols, lexemes, fields) with modifiers (context, composition, quantification) to produce precise and expressive feature descriptions that are more concise and consistent than natural language.

Contribution

Primitives and modifiers for capturing linguistic and semantic patterns

The authors develop a system of human-interpretable primitives (symbols for exact strings, lexemes for syntactic variants, fields for semantic categories) and modifiers (context, composition, quantification) that enable semantic regexes to express diverse feature activation patterns from simple token detectors to complex linguistic phenomena.

Contribution

Model-wide analysis of feature complexity using semantic regex structure

The authors demonstrate that the inherent structure of semantic regexes enables new types of analyses, such as quantifying feature complexity across model layers by measuring the abstraction level and number of components in semantic regexes, thereby scaling interpretability from individual features to model-wide patterns.