Adversarially Pretrained Transformers may be Universally Robust In-Context Learners

ICLR 2026 Conference SubmissionAnonymous Authors
Adversarial RobustnessTransformerIn-Context Learning
Abstract:

Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models, models that can robustly adapt to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can robustly generalize to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also show the two open challenges for attaining robustness: accuracy-robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can enjoy free adversarial robustness.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes that adversarially pretrained transformers can serve as universally robust foundation models, enabling robust adaptation to downstream tasks through in-context learning without additional adversarial training. It resides in the 'Adversarial Robustness Theory' leaf under 'Theoretical Foundations of In-Context Learning', which contains five papers total. This leaf focuses specifically on theoretical analysis of robustness mechanisms and defense properties in in-context learning, representing a moderately populated research direction within a taxonomy of forty papers across the broader field of adversarial robustness in pretrained transformers.

The paper's leaf sits alongside 'Learning Dynamics and Generalization Theory', which examines non-adversarial properties of in-context learning such as algorithm implementation and distributional generalization. Neighboring branches include 'Defense Mechanisms and Robustness Enhancement', particularly 'Adversarial Training and Pretraining Strategies', which contains four papers on training-time defenses. The taxonomy's scope note clarifies that this leaf focuses on theoretical analysis rather than empirical evaluation or attack methods, positioning the work at the intersection of foundational theory and proactive defense design through pretraining strategies.

Among thirty candidates examined, contribution analysis reveals mixed novelty signals. The core theoretical analysis of universally robust pretrained transformers examined ten candidates with zero refutations, suggesting this framing may be relatively unexplored. However, the condition for robust adaptation based on robust versus non-robust features examined ten candidates and found one refutable match, indicating some overlap with existing frameworks. The identification of accuracy-robustness trade-offs and sample complexity challenges examined ten candidates with no refutations, though these are well-known phenomena in adversarial learning. The limited search scope means substantial relevant work may exist outside the top-thirty semantic matches examined.

Based on the examined literature, the universal robustness framing for pretrained transformers appears less explored than the underlying trade-offs and feature frameworks. The analysis covers top-thirty semantic matches plus citation expansion, providing reasonable coverage of closely related theoretical work but not exhaustive field-wide search. The taxonomy structure suggests this theoretical robustness direction, while moderately populated, remains less saturated than empirical attack-defense cycles or domain-specific applications.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: adversarial robustness of pretrained transformers through in-context learning. The field has organized itself around six main branches that collectively address how transformers learn from demonstrations and how that learning can be attacked or defended. Theoretical Foundations of In-Context Learning explores the mathematical underpinnings, examining how transformers implement algorithms like linear regression (Transformers Learn Linear Models[1]) and the role of architectural choices such as positional encodings (Positional Encoding Complexity[23]). Adversarial Attacks on In-Context Learning investigates vulnerabilities ranging from context hijacking (Context Hijacking Robustness[6], Hijacking via Adversarial ICL[8]) to data poisoning (Data Poisoning ICL[4]) and retrieval manipulation (Neural Ranking Attacks[11]). Defense Mechanisms and Robustness Enhancement develops protective strategies including robust retrieval methods (Robust Retrieval Augmented Learning[2], Safeguarding Retrieval ICL[12]) and specialized shields (ICLShield[16]). Empirical Robustness Evaluation systematically tests model behavior under adversarial conditions (Retrieval Robustness Evaluation[5]), while Domain-Specific Applications adapt these insights to translation (Robust Translation ICL[13]), reinforcement learning (Robust ICL Reinforcement[26]), and other tasks. Reliability and Trustworthiness Frameworks address broader concerns about securing foundation models (Securing Foundation Models[24]) and ensuring safe deployment. A particularly active tension exists between understanding in-context learning as implicit optimization versus studying its failure modes under adversarial pressure. Works like Adversarial Robustness Linear Regression[3] and Linear Models Adversarial Lens[9] bridge theory and robustness by analyzing how adversarial perturbations affect the linear models that transformers approximate during in-context learning. Adversarially Pretrained Transformers[0] sits squarely within this theoretical robustness cluster, examining how pretraining strategies can build inherent resilience into the learning process itself. Compared to Adversarial Robustness Linear Regression[3], which focuses on the mathematical properties of robust regression in the ICL setting, the original work emphasizes pretraining as a proactive defense mechanism. Meanwhile, Linear Models Adversarial Lens[9] provides complementary analysis of how adversarial examples interact with the implicit models learned in-context, offering a diagnostic perspective that complements the constructive approach of adversarial pretraining.

Claimed Contributions

Theoretical analysis of universally robust adversarially pretrained transformers

The authors provide the first theoretical support showing that single-layer linear transformers, after adversarial pretraining on multiple classification tasks, can robustly generalize to unseen tasks through in-context learning from clean demonstrations alone, without requiring additional adversarial training or examples.

10 retrieved papers
Condition for robust adaptation based on robust and non-robust features framework

The authors derive theoretical conditions under which adversarially pretrained transformers achieve universal robustness by demonstrating that these models adaptively prioritize robust features over non-robust features in downstream tasks, using the conceptual framework of robust versus non-robust features.

10 retrieved papers
Can Refute
Identification of accuracy-robustness trade-off and sample-hungry in-context learning as open problems

The authors formally show that adversarially pretrained single-layer linear transformers exhibit two persistent challenges: lower clean accuracy compared to standard models and the requirement for more in-context demonstrations to achieve comparable performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of universally robust adversarially pretrained transformers

The authors provide the first theoretical support showing that single-layer linear transformers, after adversarial pretraining on multiple classification tasks, can robustly generalize to unseen tasks through in-context learning from clean demonstrations alone, without requiring additional adversarial training or examples.

Contribution

Condition for robust adaptation based on robust and non-robust features framework

The authors derive theoretical conditions under which adversarially pretrained transformers achieve universal robustness by demonstrating that these models adaptively prioritize robust features over non-robust features in downstream tasks, using the conceptual framework of robust versus non-robust features.

Contribution

Identification of accuracy-robustness trade-off and sample-hungry in-context learning as open problems

The authors formally show that adversarially pretrained single-layer linear transformers exhibit two persistent challenges: lower clean accuracy compared to standard models and the requirement for more in-context demonstrations to achieve comparable performance.