AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Large Vision-Language ModelsAdversarial Training
Abstract:

Large Vision-Language Models (LVLMs), such as GPT-4o and LLaVA, have recently witnessed remarkable advancements and are increasingly being deployed in real-world applications. However, inheriting the sensitivity of visual neural networks, LVLMs remain vulnerable to adversarial attacks, which can result in erroneous or malicious outputs. While existing efforts utilize adversarial fine-tuning to enhance robustness, they often suffer from significant performance degradation on clean inputs. In this paper, we propose AdPO, a novel adversarial defense strategy for LVLMs based on preference optimization. For the first time, we reframe adversarial training as a preference optimization problem, aiming to enhance the model’s preference for generating normal outputs on clean inputs while rejecting the potential misleading outputs for adversarial examples. Notably, AdPO achieves this by solely modifying the image encoder, e.g., CLIP ViT, resulting in superior clean and adversarial performance in a variety of downstream tasks. Due to the computational cost of training large language models, we show that training on smaller LVLMs and transferring to larger ones achieves state-of-the-art performance with efficiency comparable to previous methods. Our comprehensive experiments confirm the effectiveness of the proposed AdPO which highlights the potential of preference-based learning in adversarially robust multimodal systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Enhancing adversarial robustness of large vision-language models. The field is organized around five main branches that collectively address how to make multimodal systems more resilient to adversarial perturbations. Defense Mechanisms and Robustness Enhancement encompasses a wide range of techniques—from prompt-based methods like PromptSmooth[8] and R-TPT[7] to architectural modifications such as ArmorCLIP[24] and training-time interventions including Robust-LLaVA[28]—all aimed at hardening models against attacks. Attack Strategies and Vulnerability Analysis explores how adversaries can exploit weaknesses in vision-language models, examining both image-level perturbations and text-based jailbreaking approaches. Evaluation and Analysis Frameworks provide systematic ways to measure robustness across diverse settings, while Surveys and Comprehensive Studies offer broad perspectives on safety and adversarial challenges in multimodal systems. Related Topics and Auxiliary Studies connect robustness research to broader concerns such as missing modality handling and domain-specific applications. Within Defense Mechanisms, a particularly active line of work focuses on specialized techniques that adapt models at inference or training time without full retraining. Preference optimization methods, prompt tuning strategies like Adversarial Prompt Tuning[10] and Few-shot Adversarial Prompt[5], and ensemble-based defenses represent contrasting trade-offs between computational overhead and robustness gains. AdPO[0] sits within this specialized defense cluster, emphasizing preference optimization to align model behavior under adversarial conditions. Compared to prompt-smoothing approaches such as PromptSmooth[8] that aggregate predictions over perturbed prompts, or test-time adaptation methods like Tapt[2] that refine representations dynamically, AdPO[0] leverages preference signals to guide the model toward more robust decision boundaries. This positions it alongside works like Alignment Perturbation[11] that also explore alignment-based defenses, yet AdPO[0] distinctively integrates preference learning into the robustness enhancement pipeline, offering a complementary angle to purely prompt-based or architectural defenses.

Claimed Contributions

AdPO: Adversarial defense strategy based on preference optimization

The authors propose AdPO, a novel adversarial defense method that reframes adversarial training as a preference optimization problem. This approach enhances LVLMs' preference for generating correct outputs on clean inputs while rejecting misleading outputs on adversarial examples, representing the first application of preference optimization techniques to adversarial training.

10 retrieved papers
Dual optimization strategy combining PIO and AIO

The authors introduce two complementary optimization components: Preferred Image Optimization increases probability of correct outputs under clean inputs while decreasing erroneous outputs under adversarial images, and Adversarial Image Optimization explicitly optimizes for correct responses under adversarial inputs. This dual approach serves as a general adversarial training framework applicable beyond specific algorithms or models.

10 retrieved papers
Transfer learning approach from smaller to larger LVLMs

The authors demonstrate that adversarial training can be performed on smaller LVLM models (e.g., TinyLLaVA) and the resulting robust image encoder can be transferred to larger models. This strategy achieves computational efficiency comparable to previous methods while reducing overfitting risks and enabling fair comparison with prior CLIP-based approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AdPO: Adversarial defense strategy based on preference optimization

The authors propose AdPO, a novel adversarial defense method that reframes adversarial training as a preference optimization problem. This approach enhances LVLMs' preference for generating correct outputs on clean inputs while rejecting misleading outputs on adversarial examples, representing the first application of preference optimization techniques to adversarial training.

Contribution

Dual optimization strategy combining PIO and AIO

The authors introduce two complementary optimization components: Preferred Image Optimization increases probability of correct outputs under clean inputs while decreasing erroneous outputs under adversarial images, and Adversarial Image Optimization explicitly optimizes for correct responses under adversarial inputs. This dual approach serves as a general adversarial training framework applicable beyond specific algorithms or models.

Contribution

Transfer learning approach from smaller to larger LVLMs

The authors demonstrate that adversarial training can be performed on smaller LVLM models (e.g., TinyLLaVA) and the resulting robust image encoder can be transferred to larger models. This strategy achieves computational efficiency comparable to previous methods while reducing overfitting risks and enabling fair comparison with prior CLIP-based approaches.