AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Large Vision-Language ModelsAdversarial Training

Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from significant performance degradation on clean inputs. In this paper, we propose AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model’s preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downstream tasks. Due to the computational cost of training large language models, we show that training on smaller LVLMs and transferring to larger ones achieves state-of-the-art performance with efficiency comparable to previous methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO which highlights the potential of preference-based learning in adversarially robust multimodal systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Enhancing adversarial robustness of large vision-language models. The field is organized around five main branches that collectively address how to make multimodal systems more resilient to adversarial perturbations. Defense Mechanisms and Robustness Enhancement encompasses a wide range of techniques—from prompt-based methods like PromptSmooth[8] and R-TPT[7] to architectural modifications such as ArmorCLIP[24] and training-time interventions including Robust-LLaVA[28]—all aimed at hardening models against attacks. Attack Strategies and Vulnerability Analysis explores how adversaries can exploit weaknesses in vision-language models, examining both image-level perturbations and text-based jailbreaking approaches. Evaluation and Analysis Frameworks provide systematic ways to measure robustness across diverse settings, while Surveys and Comprehensive Studies offer broad perspectives on safety and adversarial challenges in multimodal systems. Related Topics and Auxiliary Studies connect robustness research to broader concerns such as missing modality handling and domain-specific applications. Within Defense Mechanisms, a particularly active line of work focuses on specialized techniques that adapt models at inference or training time without full retraining. Preference optimization methods, prompt tuning strategies like Adversarial Prompt Tuning[10] and Few-shot Adversarial Prompt[5], and ensemble-based defenses represent contrasting trade-offs between computational overhead and robustness gains. AdPO[0] sits within this specialized defense cluster, emphasizing preference optimization to align model behavior under adversarial conditions. Compared to prompt-smoothing approaches such as PromptSmooth[8] that aggregate predictions over perturbed prompts, or test-time adaptation methods like Tapt[2] that refine representations dynamically, AdPO[0] leverages preference signals to guide the model toward more robust decision boundaries. This positions it alongside works like Alignment Perturbation[11] that also explore alignment-based defenses, yet AdPO[0] distinctively integrates preference learning into the robustness enhancement pipeline, offering a complementary angle to purely prompt-based or architectural defenses.

Claimed Contributions

AdPO: Adversarial defense strategy based on preference optimization

10 retrieved papers

The authors propose AdPO, a novel adversarial defense method that reframes adversarial training as a preference optimization problem. This approach enhances LVLMs' preference for generating correct outputs on clean inputs while rejecting misleading outputs on adversarial examples, representing the first application of preference optimization techniques to adversarial training.

10 retrieved papers

Dual optimization strategy combining PIO and AIO

10 retrieved papers

The authors introduce two complementary optimization components: Preferred Image Optimization increases probability of correct outputs under clean inputs while decreasing erroneous outputs under adversarial images, and Adversarial Image Optimization explicitly optimizes for correct responses under adversarial inputs. This dual approach serves as a general adversarial training framework applicable beyond specific algorithms or models.

10 retrieved papers

Transfer learning approach from smaller to larger LVLMs

10 retrieved papers

The authors demonstrate that adversarial training can be performed on smaller LVLM models (e.g., TinyLLaVA) and the resulting robust image encoder can be transferred to larger models. This strategy achieves computational efficiency comparable to previous methods while reducing overfitting risks and enabling fair comparison with prior CLIP-based approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AdPO: Adversarial defense strategy based on preference optimization

[71] A preference-driven paradigm for enhanced translation with large language models PDF

Cannot Refute

[72] Calibrated self-rewarding vision language models PDF

Cannot Refute

[73] Positive enhanced preference alignment for text-to-image models PDF

Cannot Refute

[74] Modality-balancing preference optimization of large multimodal models by adversarial negative mining PDF

Cannot Refute

[75] Structured preference modeling for reinforcement learning-based fine-tuning of large models PDF

Cannot Refute

[76] Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning PDF

Cannot Refute

[77] Structured preference optimization for vision-language long-horizon task planning PDF

Cannot Refute

[78] Aligning modalities in vision large language models via preference fine-tuning PDF

Cannot Refute

[79] SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models PDF

Cannot Refute

[80] TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making PDF

Cannot Refute

Contribution

Dual optimization strategy combining PIO and AIO

[61] On the duality between sharpness-aware minimization and adversarial training PDF

Cannot Refute

[62] Solving the robustness puzzle: The joint impact of optimization approach, robustness metrics, and scenarios on water resources management under deep â¦ PDF

Cannot Refute

[63] Balancing generalization and robustness in adversarial training via steering through clean and adversarial gradient directions PDF

Cannot Refute

[64] Fortify the guardian, not the treasure: Resilient adversarial detectors PDF

Cannot Refute

[65] Trade-off between robustness and accuracy of vision transformers PDF

Cannot Refute

[66] Robustness and accuracy could be reconcilable by (proper) definition PDF

Cannot Refute

[67] Improving the accuracy-robustness trade-off of classifiers via adaptive smoothing PDF

Cannot Refute

[68] Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach PDF

Cannot Refute

[69] R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization PDF

Cannot Refute

[70] Enhancing infrared small target detection robustness with bi-level adversarial framework PDF

Cannot Refute

Contribution

Transfer learning approach from smaller to larger LVLMs

[51] Common knowledge learning for generating transferable adversarial examples PDF

Cannot Refute

[52] KDRSFL: A knowledge distillation resistance transfer framework for defending model inversion attacks in split federated learning PDF

Cannot Refute

[53] On the benefits of knowledge distillation for adversarial robustness PDF

Cannot Refute

[54] On transfer of adversarial robustness from pretraining to downstream tasks PDF

Cannot Refute

[55] Initialization matters for adversarial transfer learning PDF

Cannot Refute

[56] Reinforced compressive neural architecture search for versatile adversarial robustness PDF

Cannot Refute

[57] Improving adversarial robustness using knowledge distillation guided by attention information bottleneck PDF

Cannot Refute

[58] Improving Adversarial Robustness Through Adaptive Learning-Driven Multi-Teacher Knowledge Distillation PDF

Cannot Refute

[59] Adversarially robust distillation PDF

Cannot Refute

[60] Adversarially robust transfer learning PDF

Cannot Refute

AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

AdPO: Adversarial defense strategy based on preference optimization

[71] A preference-driven paradigm for enhanced translation with large language models PDF

[72] Calibrated self-rewarding vision language models PDF

[73] Positive enhanced preference alignment for text-to-image models PDF

[74] Modality-balancing preference optimization of large multimodal models by adversarial negative mining PDF

[75] Structured preference modeling for reinforcement learning-based fine-tuning of large models PDF

[76] Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning PDF

[77] Structured preference optimization for vision-language long-horizon task planning PDF

[78] Aligning modalities in vision large language models via preference fine-tuning PDF

[79] SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models PDF

[80] TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making PDF

Dual optimization strategy combining PIO and AIO

[61] On the duality between sharpness-aware minimization and adversarial training PDF

[62] Solving the robustness puzzle: The joint impact of optimization approach, robustness metrics, and scenarios on water resources management under deep â¦ PDF

[63] Balancing generalization and robustness in adversarial training via steering through clean and adversarial gradient directions PDF

[64] Fortify the guardian, not the treasure: Resilient adversarial detectors PDF

[65] Trade-off between robustness and accuracy of vision transformers PDF

[66] Robustness and accuracy could be reconcilable by (proper) definition PDF

[67] Improving the accuracy-robustness trade-off of classifiers via adaptive smoothing PDF

[68] Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach PDF

[69] R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization PDF

[70] Enhancing infrared small target detection robustness with bi-level adversarial framework PDF

Transfer learning approach from smaller to larger LVLMs

[51] Common knowledge learning for generating transferable adversarial examples PDF

[52] KDRSFL: A knowledge distillation resistance transfer framework for defending model inversion attacks in split federated learning PDF

[53] On the benefits of knowledge distillation for adversarial robustness PDF

[54] On transfer of adversarial robustness from pretraining to downstream tasks PDF

[55] Initialization matters for adversarial transfer learning PDF

[56] Reinforced compressive neural architecture search for versatile adversarial robustness PDF

[57] Improving adversarial robustness using knowledge distillation guided by attention information bottleneck PDF

[58] Improving Adversarial Robustness Through Adaptive Learning-Driven Multi-Teacher Knowledge Distillation PDF

[59] Adversarially robust distillation PDF

[60] Adversarially robust transfer learning PDF

Table of Contents

[62] Solving the robustness puzzle: The joint impact of optimization approach, robustness metrics, and scenarios on water resources management under deep â¦ PDF