Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language
Overview
Overall Novelty Assessment
The paper introduces semantic regexes, a structured language for describing LLM features through compositional primitives and modifiers. It resides in the Structured Feature Representation Frameworks leaf, which contains only this single paper within the broader Feature Evaluation and Validation branch. This positioning reflects a relatively sparse research direction focused on formal, structured alternatives to natural language feature descriptions. While the parent branch addresses feature validation more broadly, this specific leaf represents a novel approach to the representation problem itself rather than evaluation metrics or automated explanation generation.
The taxonomy reveals that most neighboring work pursues different validation strategies. The sibling leaf Automated Feature Description Generation contains papers like Automatically Interpreting Features and Neuron Descriptions that generate natural language explanations at scale, prioritizing automation over structural guarantees. Another sibling, Evaluation Without Explanations, bypasses linguistic descriptions entirely. The broader Feature Extraction branch focuses on discovery methods like sparse autoencoders, while Feature Application explores downstream uses of extracted features. Semantic regexes occupy a distinct niche by providing formal compositional structure for feature descriptions, bridging the gap between automated generation and rigorous representation.
Among twenty-nine candidates examined across three contributions, none clearly refute the core claims. The semantic regex language itself (ten candidates examined, zero refutable) appears novel as a structured formalism combining linguistic and semantic primitives. The primitives and modifiers framework (nine candidates, zero refutable) shows no direct prior work on this specific compositional approach. Model-wide complexity analysis using regex structure (ten candidates, zero refutable) likewise lacks clear precedent. This limited search scope suggests the structured regex formalism represents a genuinely unexplored direction, though the analysis cannot rule out relevant work outside the top-K semantic matches or citation network examined.
Based on this constrained literature search, the work appears to introduce a novel representational framework within a sparsely populated research direction. The absence of sibling papers in its taxonomy leaf and zero refutable candidates across contributions suggest substantive originality, though the twenty-nine-paper scope leaves open the possibility of overlooked related work in formal methods or program synthesis communities outside the core interpretability literature examined here.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose semantic regexes, a structured language that describes LLM features by combining primitives (symbols, lexemes, fields) with modifiers (context, composition, quantification) to produce precise and expressive feature descriptions that are more concise and consistent than natural language.
The authors develop a system of human-interpretable primitives (symbols for exact strings, lexemes for syntactic variants, fields for semantic categories) and modifiers (context, composition, quantification) that enable semantic regexes to express diverse feature activation patterns from simple token detectors to complex linguistic phenomena.
The authors demonstrate that the inherent structure of semantic regexes enables new types of analyses, such as quantifying feature complexity across model layers by measuring the abstraction level and number of components in semantic regexes, thereby scaling interpretability from individual features to model-wide patterns.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Semantic regexes: a structured language for LLM feature descriptions
The authors propose semantic regexes, a structured language that describes LLM features by combining primitives (symbols, lexemes, fields) with modifiers (context, composition, quantification) to produce precise and expressive feature descriptions that are more concise and consistent than natural language.
[71] Natural language descriptions of deep visual features PDF
[72] Foundations of symbolic languages for model interpretability PDF
[73] Causal abstractions of neural networks PDF
[74] Linguistic Interpretability of Transformer-based Language Models: a systematic review PDF
[75] Neurons to Words: A Novel Method for Automated Neural Network Interpretability and Alignment PDF
[76] Local interpretations for explainable natural language processing: A survey PDF
[77] An Interpretable Dynamic Inference System Based on Fuzzy Broad Learning PDF
[78] Enhancing Explainability and Accelerating Materials Science Design with Linguistic Summaries PDF
[79] Weighted automata extraction and explanation of recurrent neural networks for natural language tasks PDF
[80] Improving interpretability of deep neural networks with semantic information PDF
Primitives and modifiers for capturing linguistic and semantic patterns
The authors develop a system of human-interpretable primitives (symbols for exact strings, lexemes for syntactic variants, fields for semantic categories) and modifiers (context, composition, quantification) that enable semantic regexes to express diverse feature activation patterns from simple token detectors to complex linguistic phenomena.
[61] Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations PDF
[62] Structural and semantic features of adjectives across languages and registers PDF
[63] Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation PDF
[64] Learning Structured Natural Language Representations for Semantic Parsing PDF
[65] Structural and semantic features of linguistic units of the English-language linguocultural scenario âProductsâ PDF
[66] Dialectal phraseological units of the Yakut language: structure and semantics PDF
[67] A preferential, pattern-seeking, semantics for natural language inference PDF
[69] Semantic construction in feature-based TAG PDF
[70] Using slots and modifiers in logic grammars for natural language PDF
Model-wide analysis of feature complexity using semantic regex structure
The authors demonstrate that the inherent structure of semantic regexes enables new types of analyses, such as quantifying feature complexity across model layers by measuring the abstraction level and number of components in semantic regexes, thereby scaling interpretability from individual features to model-wide patterns.