Universal Approximation with Softmax Attention

ICLR 2026 Conference SubmissionAnonymous Authors
universal approximationattentionexpressivenessin-context learning
Abstract:

We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention’s internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention or even one-layer multi-head attention followed by a softmax function suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating gradient descent in-context. We believe these techniques hold independent interest.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes that two-layer self-attention and one-layer self-attention followed by softmax can universally approximate continuous sequence-to-sequence functions on compact domains. It resides in the 'Universal Approximation Theory for Attention-Based Architectures' leaf, which contains seven papers total. This leaf sits within the broader 'Theoretical Foundations' branch, indicating a moderately populated research direction focused on formal expressiveness guarantees. The taxonomy shows this is a core theoretical area with active inquiry into what minimal architectural components suffice for universal approximation.

The taxonomy reveals neighboring leaves examining computational power and formal language capabilities, computational complexity analysis, and expressive power mechanisms. The paper's focus on approximation guarantees distinguishes it from complexity-oriented work and from studies of Turing-completeness or formal language recognition. Its sibling papers in the same leaf explore related universal approximation questions, suggesting a coherent research thread investigating which transformer components are theoretically necessary. The taxonomy structure indicates this theoretical branch is well-developed but not overcrowded, with clear boundaries separating approximation theory from architectural design and empirical applications.

Among twenty-two candidates examined across three contributions, none were found to clearly refute the paper's claims. The interpolation-based analysis method examined ten candidates with zero refutations; the generalized ReLU approximation result examined two candidates with zero refutations; and the two-layer sufficiency claim examined ten candidates with zero refutations. This limited search scope suggests that within the top semantic matches and citation neighborhood, no prior work directly establishes the same results. However, the modest candidate pool means the analysis cannot rule out relevant prior work outside this search radius.

Given the limited literature search covering twenty-two candidates, the paper appears to occupy a relatively novel position within its theoretical niche. The absence of refutable prior work among examined candidates, combined with the moderately populated taxonomy leaf, suggests the specific combination of techniques and results may be new. However, the search scope leaves open the possibility of related approximation results in adjacent theoretical areas not captured by semantic similarity or immediate citation links.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: universal approximation of attention mechanisms for sequence-to-sequence functions. The field structure reflects a balance between rigorous theoretical inquiry and practical deployment. The taxonomy organizes work into several main branches: Theoretical Foundations of Attention and Transformer Expressiveness examines the representational capacity and computational limits of attention-based architectures, often drawing on universal approximation results (e.g., Transformers Universal Approximators[4], Computational Power Transformers[40]); Architectural Variants and Enhancements explores modifications such as sparse patterns (Sparse Transformers[24]), alternative attention formulations (Sigmoid Self-Attention[33]), and efficiency improvements (Nystromformer[20]); Training Methodologies and Optimization addresses learning dynamics and convergence; Analysis and Interpretability of Attention Mechanisms investigates what attention weights reveal about model behavior (Analyzing Attention[18]); and Application Domains spans speech (Speech Transformer[6]), vision (Semantic Segmentation Transformers[1]), forecasting (Solar Power Prediction[35]), and text generation tasks. A particularly active line of work centers on understanding the expressive power of transformers and their components, with studies probing whether single-layer or simplified architectures suffice for certain function classes (Single-Layer Transformer[31], Transformer Expressive Power[13]) and whether prompting alone can achieve universal approximation (Prompting Universal Approximator[41], Prompt Tuning Limits[42]). Universal Approximation Softmax[0] sits squarely within this theoretical branch, contributing formal guarantees on the capacity of softmax-based attention to approximate sequence-to-sequence mappings. Its emphasis on foundational approximation properties aligns closely with Transformers Universal Approximators[4] and Multi-Head Self-Attention Theory[50], which similarly establish representational bounds, yet it contrasts with works like Hard Attention Transformers[9] or Variational Attention[3] that modify the attention mechanism itself. By anchoring its analysis in classical approximation theory, Universal Approximation Softmax[0] helps clarify which architectural features are essential for expressiveness and which are primarily optimization or efficiency concerns.

Claimed Contributions

Interpolation-based method for analyzing attention's internal mechanism

The authors introduce a novel interpolation selection technique that partitions the target function's output range into uniform anchors, embeds them into attention's key-query-value transformations, and uses softmax to approximate argmax-style selection. This method demonstrates that attention can simulate piecewise linear behavior without relying on auxiliary feed-forward layers.

10 retrieved papers
Single-head and multi-head attention approximate generalized ReLUs

The authors prove that single-head attention approximates n generalized ReLU functions (truncated linear models) with O(1/n) precision, and that H-head attention improves this to O(1/(nH)) precision. This establishes that attention mechanisms can replicate the behavior of known universal approximators like ReLU networks.

2 retrieved papers
Two-layer attention suffices for sequence-to-sequence universal approximation

The authors demonstrate that either two stacked attention layers or one attention layer followed by softmax can universally approximate any continuous sequence-to-sequence function on compact domains. This result shows that attention alone provides the core expressiveness without requiring feed-forward networks or deep attention stacks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Interpolation-based method for analyzing attention's internal mechanism

The authors introduce a novel interpolation selection technique that partitions the target function's output range into uniform anchors, embeds them into attention's key-query-value transformations, and uses softmax to approximate argmax-style selection. This method demonstrates that attention can simulate piecewise linear behavior without relying on auxiliary feed-forward layers.

Contribution

Single-head and multi-head attention approximate generalized ReLUs

The authors prove that single-head attention approximates n generalized ReLU functions (truncated linear models) with O(1/n) precision, and that H-head attention improves this to O(1/(nH)) precision. This establishes that attention mechanisms can replicate the behavior of known universal approximators like ReLU networks.

Contribution

Two-layer attention suffices for sequence-to-sequence universal approximation

The authors demonstrate that either two stacked attention layers or one attention layer followed by softmax can universally approximate any continuous sequence-to-sequence function on compact domains. This result shows that attention alone provides the core expressiveness without requiring feed-forward networks or deep attention stacks.