Universal Approximation with Softmax Attention

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

universal approximationattentionexpressivenessin-context learning

We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention’s internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention or even one-layer multi-head attention followed by a softmax function suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating gradient descent in-context. We believe these techniques hold independent interest.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes that two-layer self-attention and one-layer self-attention followed by softmax can universally approximate continuous sequence-to-sequence functions on compact domains. It resides in the 'Universal Approximation Theory for Attention-Based Architectures' leaf, which contains seven papers total. This leaf sits within the broader 'Theoretical Foundations' branch, indicating a moderately populated research direction focused on formal expressiveness guarantees. The taxonomy shows this is a core theoretical area with active inquiry into what minimal architectural components suffice for universal approximation.

The taxonomy reveals neighboring leaves examining computational power and formal language capabilities, computational complexity analysis, and expressive power mechanisms. The paper's focus on approximation guarantees distinguishes it from complexity-oriented work and from studies of Turing-completeness or formal language recognition. Its sibling papers in the same leaf explore related universal approximation questions, suggesting a coherent research thread investigating which transformer components are theoretically necessary. The taxonomy structure indicates this theoretical branch is well-developed but not overcrowded, with clear boundaries separating approximation theory from architectural design and empirical applications.

Among twenty-two candidates examined across three contributions, none were found to clearly refute the paper's claims. The interpolation-based analysis method examined ten candidates with zero refutations; the generalized ReLU approximation result examined two candidates with zero refutations; and the two-layer sufficiency claim examined ten candidates with zero refutations. This limited search scope suggests that within the top semantic matches and citation neighborhood, no prior work directly establishes the same results. However, the modest candidate pool means the analysis cannot rule out relevant prior work outside this search radius.

Given the limited literature search covering twenty-two candidates, the paper appears to occupy a relatively novel position within its theoretical niche. The absence of refutable prior work among examined candidates, combined with the moderately populated taxonomy leaf, suggests the specific combination of techniques and results may be new. However, the search scope leaves open the possibility of related approximation results in adjacent theoretical areas not captured by semantic similarity or immediate citation links.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: universal approximation of attention mechanisms for sequence-to-sequence functions. The field structure reflects a balance between rigorous theoretical inquiry and practical deployment. The taxonomy organizes work into several main branches: Theoretical Foundations of Attention and Transformer Expressiveness examines the representational capacity and computational limits of attention-based architectures, often drawing on universal approximation results (e.g., Transformers Universal Approximators[4], Computational Power Transformers[40]); Architectural Variants and Enhancements explores modifications such as sparse patterns (Sparse Transformers[24]), alternative attention formulations (Sigmoid Self-Attention[33]), and efficiency improvements (Nystromformer[20]); Training Methodologies and Optimization addresses learning dynamics and convergence; Analysis and Interpretability of Attention Mechanisms investigates what attention weights reveal about model behavior (Analyzing Attention[18]); and Application Domains spans speech (Speech Transformer[6]), vision (Semantic Segmentation Transformers[1]), forecasting (Solar Power Prediction[35]), and text generation tasks. A particularly active line of work centers on understanding the expressive power of transformers and their components, with studies probing whether single-layer or simplified architectures suffice for certain function classes (Single-Layer Transformer[31], Transformer Expressive Power[13]) and whether prompting alone can achieve universal approximation (Prompting Universal Approximator[41], Prompt Tuning Limits[42]). Universal Approximation Softmax[0] sits squarely within this theoretical branch, contributing formal guarantees on the capacity of softmax-based attention to approximate sequence-to-sequence mappings. Its emphasis on foundational approximation properties aligns closely with Transformers Universal Approximators[4] and Multi-Head Self-Attention Theory[50], which similarly establish representational bounds, yet it contrasts with works like Hard Attention Transformers[9] or Variational Attention[3] that modify the attention mechanism itself. By anchoring its analysis in classical approximation theory, Universal Approximation Softmax[0] helps clarify which architectural features are essential for expressiveness and which are primarily optimization or efficiency concerns.

Claimed Contributions

Interpolation-based method for analyzing attention's internal mechanism

10 retrieved papers

The authors introduce a novel interpolation selection technique that partitions the target function's output range into uniform anchors, embeds them into attention's key-query-value transformations, and uses softmax to approximate argmax-style selection. This method demonstrates that attention can simulate piecewise linear behavior without relying on auxiliary feed-forward layers.

10 retrieved papers

Single-head and multi-head attention approximate generalized ReLUs

2 retrieved papers

The authors prove that single-head attention approximates n generalized ReLU functions (truncated linear models) with O(1/n) precision, and that H-head attention improves this to O(1/(nH)) precision. This establishes that attention mechanisms can replicate the behavior of known universal approximators like ReLU networks.

2 retrieved papers

Two-layer attention suffices for sequence-to-sequence universal approximation

10 retrieved papers

The authors demonstrate that either two stacked attention layers or one attention layer followed by softmax can universally approximate any continuous sequence-to-sequence function on compact domains. This result shows that attention alone provides the core expressiveness without requiring feed-forward networks or deep attention stacks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Are Transformers universal approximators of sequence-to-sequence functions? PDF

Yun, Chulhee, Bhojanapalli, Srinadh, Chulhee Yun, Rawat, Ankit Singh, Srinadh Bhojanapalli, Reddi, Sashank J., A. Rawat, Kumar Sanjiv, Sashank J. Reddi, Sanjiv Kumar (2022)

[23] A unified framework on the universal approximation of transformer-type architectures PDF

J Cheng, Q Li, T Lin, Z Shen (2025)

[31] Universal Approximation Theorem for a Single-Layer Transformer PDF

Esmail Gumaan (2025)

[33] Theory, Analysis, and Best Practices for Sigmoid Self-Attention PDF

Ramapuram, Jason, Danieli, Federico, Jason Ramapuram, Dhekane, Eeshan, Federico Danieli, Weers, Floris, Eeshan Gunesh Dhekane, Busbridge, Dan, Floris Weers, Ablin, Pierre, Dan Busbridge, Likhomanenko, Tatiana, Pierre Ablin, Digani, Jagrit, T. Likhomanenko, Gu, Zijin, Jagrit Digani, Shidani, Amitis, Zijin Gu, Webb, Russ, Amitis Shidani, Russ Webb (2024)

[41] Prompting a Pretrained Transformer Can Be a Universal Approximator PDF

Petrov, Aleksandar, Aleksandar Petrov, Torr, Philip H. S., Philip H. S. Torr, Bibi, Adel, Adel Bibi (2024)

[50] Universal Approximation and Optimization Theory for Multi-Head Self-Attention: Theoretical Foundations and Scaling Laws PDF

KC Kodela (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Interpolation-based method for analyzing attention's internal mechanism

[59] Performance of imputation techniques: A comprehensive simulation study using the transformer model PDF

Cannot Refute

[60] Video Frame Interpolation Transformer PDF

Cannot Refute

[61] Assigning channel weights using an attention mechanism: an EEG interpolation algorithm PDF

Cannot Refute

[62] Enhancing video frame interpolation with region of motion loss and self-attention mechanisms: A dual approach to address large, nonlinear motions PDF

Cannot Refute

[63] Exact Sequence Interpolation with Transformers PDF

Cannot Refute

[64] Extending context window of large language models via positional interpolation PDF

Cannot Refute

[65] Extracting motion and appearance via inter-frame attention for efficient video frame interpolation PDF

Cannot Refute

[66] Parallel spatio-temporal attention transformer for video frame interpolation PDF

Cannot Refute

[67] From interpolation to extrapolation: Complete length generalization for arithmetic transformers PDF

Cannot Refute

[68] A Spatial Downscaling Approach for Enhanced Accuracy in High Wind Speed Estimation Using Hybrid Attention Transformer PDF

Cannot Refute

Contribution

Single-head and multi-head attention approximate generalized ReLUs

[51] Advancing Deep Learning for Multiagent AI: Mechanisms, Organizations, and Dynamics PDF

Cannot Refute

[52] Visual Analytics for Taxi Dispatching Based on Multi-Agent Reinforcement Learning PDF

Cannot Refute

Contribution

Two-layer attention suffices for sequence-to-sequence universal approximation

[4] Are Transformers universal approximators of sequence-to-sequence functions? PDF

Cannot Refute

[23] A unified framework on the universal approximation of transformer-type architectures PDF

Cannot Refute

[25] Sumformer: Universal Approximation for Efficient Transformers PDF

Cannot Refute

[41] Prompting a Pretrained Transformer Can Be a Universal Approximator PDF

Cannot Refute

[53] Attention Mechanism, Max-Affine Partition, and Universal Approximation PDF

Cannot Refute

[54] Approximation rate of the transformer architecture for sequence modeling PDF

Cannot Refute

[55] Towards understanding the universality of transformers for next-token prediction PDF

Cannot Refute

[56] Big bird: Transformers for longer sequences PDF

Cannot Refute

[57] STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data PDF

Cannot Refute

[58] Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input PDF

Cannot Refute

Universal Approximation with Softmax Attention

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Are Transformers universal approximators of sequence-to-sequence functions? PDF

[23] A unified framework on the universal approximation of transformer-type architectures PDF

[31] Universal Approximation Theorem for a Single-Layer Transformer PDF

[33] Theory, Analysis, and Best Practices for Sigmoid Self-Attention PDF

[41] Prompting a Pretrained Transformer Can Be a Universal Approximator PDF

[50] Universal Approximation and Optimization Theory for Multi-Head Self-Attention: Theoretical Foundations and Scaling Laws PDF

Contribution Analysis

Interpolation-based method for analyzing attention's internal mechanism

[59] Performance of imputation techniques: A comprehensive simulation study using the transformer model PDF

[60] Video Frame Interpolation Transformer PDF

[61] Assigning channel weights using an attention mechanism: an EEG interpolation algorithm PDF

[62] Enhancing video frame interpolation with region of motion loss and self-attention mechanisms: A dual approach to address large, nonlinear motions PDF

[63] Exact Sequence Interpolation with Transformers PDF

[64] Extending context window of large language models via positional interpolation PDF

[65] Extracting motion and appearance via inter-frame attention for efficient video frame interpolation PDF

[66] Parallel spatio-temporal attention transformer for video frame interpolation PDF

[67] From interpolation to extrapolation: Complete length generalization for arithmetic transformers PDF

[68] A Spatial Downscaling Approach for Enhanced Accuracy in High Wind Speed Estimation Using Hybrid Attention Transformer PDF

Single-head and multi-head attention approximate generalized ReLUs

[51] Advancing Deep Learning for Multiagent AI: Mechanisms, Organizations, and Dynamics PDF

[52] Visual Analytics for Taxi Dispatching Based on Multi-Agent Reinforcement Learning PDF

Two-layer attention suffices for sequence-to-sequence universal approximation

[4] Are Transformers universal approximators of sequence-to-sequence functions? PDF

[23] A unified framework on the universal approximation of transformer-type architectures PDF

[25] Sumformer: Universal Approximation for Efficient Transformers PDF

[41] Prompting a Pretrained Transformer Can Be a Universal Approximator PDF

[53] Attention Mechanism, Max-Affine Partition, and Universal Approximation PDF

[54] Approximation rate of the transformer architecture for sequence modeling PDF

[55] Towards understanding the universality of transformers for next-token prediction PDF

[56] Big bird: Transformers for longer sequences PDF

[57] STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking for Sequential Data PDF

[58] Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input PDF

Table of Contents