Adversarially Pretrained Transformers may be Universally Robust In-Context Learners

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Adversarial RobustnessTransformerIn-Context Learning

Adversarial training is one of the most effective adversarial defenses, but it incurs a high computational cost. In this study, we present the first theoretical analysis suggesting that adversarially pretrained transformers can serve as universally robust foundation models, models that can robustly adapt to diverse downstream tasks with only lightweight tuning. Specifically, we demonstrate that single-layer linear transformers, after adversarial pretraining across a variety of classification tasks, can robustly generalize to unseen classification tasks through in-context learning from clean demonstrations (i.e., without requiring additional adversarial training or examples). This universal robustness stems from the model's ability to adaptively focus on robust features within given tasks. We also show the two open challenges for attaining robustness: accuracy-robustness trade-off and sample-hungry training. This study initiates the discussion on the utility of universally robust foundation models. While their training is expensive, the investment would prove worthwhile as downstream tasks can enjoy free adversarial robustness.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes that adversarially pretrained transformers can serve as universally robust foundation models, enabling robust adaptation to downstream tasks through in-context learning without additional adversarial training. It resides in the 'Adversarial Robustness Theory' leaf under 'Theoretical Foundations of In-Context Learning', which contains five papers total. This leaf focuses specifically on theoretical analysis of robustness mechanisms and defense properties in in-context learning, representing a moderately populated research direction within a taxonomy of forty papers across the broader field of adversarial robustness in pretrained transformers.

The paper's leaf sits alongside 'Learning Dynamics and Generalization Theory', which examines non-adversarial properties of in-context learning such as algorithm implementation and distributional generalization. Neighboring branches include 'Defense Mechanisms and Robustness Enhancement', particularly 'Adversarial Training and Pretraining Strategies', which contains four papers on training-time defenses. The taxonomy's scope note clarifies that this leaf focuses on theoretical analysis rather than empirical evaluation or attack methods, positioning the work at the intersection of foundational theory and proactive defense design through pretraining strategies.

Among thirty candidates examined, contribution analysis reveals mixed novelty signals. The core theoretical analysis of universally robust pretrained transformers examined ten candidates with zero refutations, suggesting this framing may be relatively unexplored. However, the condition for robust adaptation based on robust versus non-robust features examined ten candidates and found one refutable match, indicating some overlap with existing frameworks. The identification of accuracy-robustness trade-offs and sample complexity challenges examined ten candidates with no refutations, though these are well-known phenomena in adversarial learning. The limited search scope means substantial relevant work may exist outside the top-thirty semantic matches examined.

Based on the examined literature, the universal robustness framing for pretrained transformers appears less explored than the underlying trade-offs and feature frameworks. The analysis covers top-thirty semantic matches plus citation expansion, providing reasonable coverage of closely related theoretical work but not exhaustive field-wide search. The taxonomy structure suggests this theoretical robustness direction, while moderately populated, remains less saturated than empirical attack-defense cycles or domain-specific applications.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: adversarial robustness of pretrained transformers through in-context learning. The field has organized itself around six main branches that collectively address how transformers learn from demonstrations and how that learning can be attacked or defended. Theoretical Foundations of In-Context Learning explores the mathematical underpinnings, examining how transformers implement algorithms like linear regression (Transformers Learn Linear Models[1]) and the role of architectural choices such as positional encodings (Positional Encoding Complexity[23]). Adversarial Attacks on In-Context Learning investigates vulnerabilities ranging from context hijacking (Context Hijacking Robustness[6], Hijacking via Adversarial ICL[8]) to data poisoning (Data Poisoning ICL[4]) and retrieval manipulation (Neural Ranking Attacks[11]). Defense Mechanisms and Robustness Enhancement develops protective strategies including robust retrieval methods (Robust Retrieval Augmented Learning[2], Safeguarding Retrieval ICL[12]) and specialized shields (ICLShield[16]). Empirical Robustness Evaluation systematically tests model behavior under adversarial conditions (Retrieval Robustness Evaluation[5]), while Domain-Specific Applications adapt these insights to translation (Robust Translation ICL[13]), reinforcement learning (Robust ICL Reinforcement[26]), and other tasks. Reliability and Trustworthiness Frameworks address broader concerns about securing foundation models (Securing Foundation Models[24]) and ensuring safe deployment. A particularly active tension exists between understanding in-context learning as implicit optimization versus studying its failure modes under adversarial pressure. Works like Adversarial Robustness Linear Regression[3] and Linear Models Adversarial Lens[9] bridge theory and robustness by analyzing how adversarial perturbations affect the linear models that transformers approximate during in-context learning. Adversarially Pretrained Transformers[0] sits squarely within this theoretical robustness cluster, examining how pretraining strategies can build inherent resilience into the learning process itself. Compared to Adversarial Robustness Linear Regression[3], which focuses on the mathematical properties of robust regression in the ICL setting, the original work emphasizes pretraining as a proactive defense mechanism. Meanwhile, Linear Models Adversarial Lens[9] provides complementary analysis of how adversarial examples interact with the implicit models learned in-context, offering a diagnostic perspective that complements the constructive approach of adversarial pretraining.

Claimed Contributions

Theoretical analysis of universally robust adversarially pretrained transformers

10 retrieved papers

The authors provide the first theoretical support showing that single-layer linear transformers, after adversarial pretraining on multiple classification tasks, can robustly generalize to unseen tasks through in-context learning from clean demonstrations alone, without requiring additional adversarial training or examples.

10 retrieved papers

Condition for robust adaptation based on robust and non-robust features framework

Can Refute

10 retrieved papers

The authors derive theoretical conditions under which adversarially pretrained transformers achieve universal robustness by demonstrating that these models adaptively prioritize robust features over non-robust features in downstream tasks, using the conceptual framework of robust versus non-robust features.

10 retrieved papers

Can Refute

Identification of accuracy-robustness trade-off and sample-hungry in-context learning as open problems

10 retrieved papers

The authors formally show that adversarially pretrained single-layer linear transformers exhibit two persistent challenges: lower clean accuracy compared to standard models and the requirement for more in-context demonstrations to achieve comparable performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Adversarial robustness of in-context learning in transformers for linear regression PDF

U Anwar, J Von Oswald, L Kirsch, D Krueger (2024)

[6] On the robustness of transformers against context hijacking for linear classification PDF

Li Tianle, Zhang Chen-yang, Chen Xing-wu, Cao Yuan, Zou, Difan (2025)

[9] Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens PDF

Anwar, Usman, von Oswald, Johannes, Usman Anwar, Kirsch, Louis, Johannes von Oswald, Krueger, David, Louis Kirsch, Frei, Spencer, David Krueger, Spencer Frei (2024) • Trans. Mach. Learn. Res.

[23] Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression PDF

Weiyi He, Yue Xing (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of universally robust adversarially pretrained transformers

[41] Adversarially robust transfer learning PDF

Cannot Refute

[42] Learning Adversarially Fair and Transferable Representations PDF

Cannot Refute

[43] On adversarial training without perturbing all examples PDF

Cannot Refute

[44] Adversarially robust hypothesis transfer learning PDF

Cannot Refute

[45] Learning Robust Rewards with Adversarial Inverse Reinforcement Learning PDF

Cannot Refute

[46] Adversarial robustness in transfer learning models PDF

Cannot Refute

[47] Wasserstein distance based deep adversarial transfer learning for intelligent fault diagnosis with unlabeled or insufficient labeled data PDF

Cannot Refute

[48] CARD: Robustness-Preserving Transfer Learning for Network Intrusion Detection via Contrastive Adversarial Representation Distillation PDF

Cannot Refute

[49] Augmenting fake content detection in online platforms: A domain adaptive transfer learning via adversarial training approach PDF

Cannot Refute

[50] Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy PDF

Cannot Refute

Contribution

Condition for robust adaptation based on robust and non-robust features framework

[59] Adversarial Robustness through Disentangled Representations PDF

Can Refute

[51] Exploring robust features for improving adversarial robustness PDF

Cannot Refute

[52] Adversarial feature alignment: Balancing robustness and accuracy in deep learning via adversarial training PDF

Cannot Refute

[53] ProFeAT: Projected Feature Adversarial Training for Self-Supervised Learning of Robust Representations PDF

Cannot Refute

[54] Distilling robust and non-robust features in adversarial examples by information bottleneck PDF

Cannot Refute

[55] Evidence-Based Multi-Feature Fusion for Adversarial Robustness PDF

Cannot Refute

[56] Learning More Robust Features with Adversarial Training PDF

Cannot Refute

[57] Minimizing adversarial training samples for robust image classifiers: analysis and adversarial example generator design PDF

Cannot Refute

[58] Feature purification: How adversarial training performs robust deep learning PDF

Cannot Refute

[60] Few-Shot Anomaly Detection with Adversarial Loss for Robust Feature Representations PDF

Cannot Refute

Contribution

Identification of accuracy-robustness trade-off and sample-hungry in-context learning as open problems

[61] Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning PDF

Cannot Refute

[62] Using natural language explanations to improve robustness of in-context learning PDF

Cannot Refute

[63] Are sample-efficient nlp models more robust? PDF

Cannot Refute

[64] Enhancing in-context learning via linear probe calibration PDF

Cannot Refute

[65] Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning PDF

Cannot Refute

[66] Probing the decision boundaries of in-context learning in large language models PDF

Cannot Refute

[67] Which examples to annotate for in-context learning? towards effective and efficient selection PDF

Cannot Refute

[68] A theoretical design of concept sets: improving the predictability of concept bottleneck models PDF

Cannot Refute

[69] Context - Enhanced Meta-Reinforcement Learning with Data-Reused Adaptation for Urban Autonomous Driving PDF

Cannot Refute

[70] A Review on Machine Learning Applications in Localization in 5G and Beyond Wireless Communications PDF

Cannot Refute

Adversarially Pretrained Transformers may be Universally Robust In-Context Learners

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Adversarial robustness of in-context learning in transformers for linear regression PDF

[6] On the robustness of transformers against context hijacking for linear classification PDF

[9] Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens PDF

[23] Impact of Positional Encoding: Clean and Adversarial Rademacher Complexity for Transformers under In-Context Regression PDF

Contribution Analysis

Theoretical analysis of universally robust adversarially pretrained transformers

[41] Adversarially robust transfer learning PDF

[42] Learning Adversarially Fair and Transferable Representations PDF

[43] On adversarial training without perturbing all examples PDF

[44] Adversarially robust hypothesis transfer learning PDF

[45] Learning Robust Rewards with Adversarial Inverse Reinforcement Learning PDF

[46] Adversarial robustness in transfer learning models PDF

[47] Wasserstein distance based deep adversarial transfer learning for intelligent fault diagnosis with unlabeled or insufficient labeled data PDF

[48] CARD: Robustness-Preserving Transfer Learning for Network Intrusion Detection via Contrastive Adversarial Representation Distillation PDF

[49] Augmenting fake content detection in online platforms: A domain adaptive transfer learning via adversarial training approach PDF

[50] Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy PDF

Condition for robust adaptation based on robust and non-robust features framework

[59] Adversarial Robustness through Disentangled Representations PDF

[51] Exploring robust features for improving adversarial robustness PDF

[52] Adversarial feature alignment: Balancing robustness and accuracy in deep learning via adversarial training PDF

[53] ProFeAT: Projected Feature Adversarial Training for Self-Supervised Learning of Robust Representations PDF

[54] Distilling robust and non-robust features in adversarial examples by information bottleneck PDF

[55] Evidence-Based Multi-Feature Fusion for Adversarial Robustness PDF

[56] Learning More Robust Features with Adversarial Training PDF

[57] Minimizing adversarial training samples for robust image classifiers: analysis and adversarial example generator design PDF

[58] Feature purification: How adversarial training performs robust deep learning PDF

[60] Few-Shot Anomaly Detection with Adversarial Loss for Robust Feature Representations PDF

Identification of accuracy-robustness trade-off and sample-hungry in-context learning as open problems

[61] Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning PDF

[62] Using natural language explanations to improve robustness of in-context learning PDF

[63] Are sample-efficient nlp models more robust? PDF

[64] Enhancing in-context learning via linear probe calibration PDF

[65] Meta-reinforcement learning robust to distributional shift via performing lifelong in-context learning PDF

[66] Probing the decision boundaries of in-context learning in large language models PDF

[67] Which examples to annotate for in-context learning? towards effective and efficient selection PDF

[68] A theoretical design of concept sets: improving the predictability of concept bottleneck models PDF

[69] Context - Enhanced Meta-Reinforcement Learning with Data-Reused Adaptation for Urban Autonomous Driving PDF

[70] A Review on Machine Learning Applications in Localization in 5G and Beyond Wireless Communications PDF

Table of Contents