What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Vision Language ModelNegation UnderstandingAffirmative BiasDescribed Object DetectionChain-of-Thought ReasoningToken Merging

State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a dataset pipeline (CoVAND) and a token merging module (NegToMe) for negation understanding in described object detection. It resides in the Token-Level Negation Processing leaf under Architecture and Loss Modifications. This leaf currently contains only this paper, indicating a sparse research direction within the broader negation understanding field. The taxonomy shows 45 papers across the entire field, with most work concentrated in evaluation benchmarks and test-time adaptation rather than architectural token-level interventions during training.

The taxonomy reveals neighboring approaches in sibling branches. Contrastive Learning with Negation-Aware Objectives modifies loss functions rather than token processing, while Test-Time Adaptation methods like Embedding Space Manipulation operate at inference without architectural changes. The parent branch Architecture and Loss Modifications encompasses training-time interventions, distinguishing this work from the larger cluster of evaluation-focused studies under Negation-Specific Benchmarks. The scope note clarifies that token-level architectural modifications during training belong here, separating this from inference-only embedding manipulations or prompt engineering strategies.

Among 15 candidates examined, no contributions were clearly refuted. The CoVAND dataset examined 1 candidate with 0 refutations; NegToMe examined 4 candidates with 0 refutations; the parameter-efficient adaptation recipe examined 10 candidates with 0 refutations. This limited search scope suggests the specific combination of token merging for negation and instance-grounded dataset generation appears novel within the examined literature. The absence of refutations across all three contributions indicates that among the top-15 semantic matches, no prior work directly overlaps with binding negation tokens into coherent semantic phrases at the architectural level.

Based on the top-15 semantic search results, the work appears to occupy a relatively unexplored niche combining architectural token processing with negation-aware data generation. The sparse population of the Token-Level Negation Processing leaf and the absence of refutations suggest novelty, though the limited search scope means potentially relevant work outside the top-15 matches may exist. The analysis covers architectural and dataset contributions but does not exhaustively survey all parameter-efficient adaptation methods in vision-language models.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: negation understanding in vision-language models. The field addresses a critical weakness in modern VLMs—their tendency to misinterpret or ignore negation cues such as 'not' or 'no' when matching images to text. The taxonomy organizes research into five main branches. Negation-Aware Model Architectures and Training Methods explores modifications to model design and learning objectives, including token-level processing strategies and specialized loss functions. Evaluation and Analysis of Negation Understanding develops benchmarks and diagnostic tools to measure how well models handle negated descriptions. Domain-Specific Applications and Adaptations tailors negation handling to particular settings like medical imaging or spatial reasoning. General Robustness and Continual Learning examines broader reliability concerns that intersect with negation challenges. Auxiliary Methods and Tools provides supporting techniques such as data augmentation and prompt engineering that indirectly improve negation sensitivity. Recent work reveals contrasting approaches to the negation problem. Some studies focus on architectural interventions—Negation Aware VLMs[0] proposes token-level negation processing within the architecture branch, while Text Encoders Bottleneck[5] and NOPE[2] diagnose representational limitations in existing encoders. Others emphasize training strategies: CREPE[6] and Meaning Representation Negative[3] develop specialized contrastive objectives, whereas Dual Path Adapter[4] explores parameter-efficient adaptation. Evaluation efforts like NegVQA[29] and VLMs Not Understand Negation[21] systematically document failure modes across models. Negation Aware VLMs[0] sits within the token-level processing cluster, sharing architectural motivations with works like Text Encoders Bottleneck[5] that identify where negation information is lost, but differing from adapter-based methods like Dual Path Adapter[4] that modify frozen pretrained models. The central tension remains whether negation understanding requires fundamental architectural changes or can emerge from improved training data and objectives.

Claimed Contributions

COVAND dataset with CoT and VQA-based pipeline

1 retrieved paper

The authors present COVAND, a negation-focused dataset generated through a three-step chain-of-thought reasoning process followed by VQA-based caption alignment. This pipeline produces region-grounded positive and negative caption pairs with significantly higher negation word frequency (9.29%) than existing datasets, addressing the scarcity of negation expressions in vision-language training data.

1 retrieved paper

NEGTOME text token merging module with negation-aware boost

4 retrieved papers

The authors introduce NEGTOME, a text token merging module that groups negation cues with their modified attributes into coherent semantic phrases and applies a negation-aware boost to preserve semantic polarity. This is the first work to employ a boosted token merging strategy for preserving semantic polarity in VLM-based detection.

4 retrieved papers

Parameter-efficient adaptation recipe combining NEGTOME with strategic LoRA

10 retrieved papers

The authors propose a lightweight adaptation method that combines NEGTOME with targeted Low-Rank Adaptation (LoRA) applied to deep cross-attention layers. This strategy modifies less than 0.1% of model parameters while achieving significant improvements in negation comprehension across multiple VLM architectures.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

COVAND dataset with CoT and VQA-based pipeline

[56] Reasoning and question answering about image-text multi-modal contexts PDF

Cannot Refute

Contribution

NEGTOME text token merging module with negation-aware boost

[7] From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models PDF

Cannot Refute

[28] SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models PDF

Cannot Refute

[37] Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment PDF

Cannot Refute

[42] Safe Vision-Language Models via Unsafe Weights Manipulation PDF

Cannot Refute

Contribution

Parameter-efficient adaptation recipe combining NEGTOME with strategic LoRA

[46] Low-rank few-shot adaptation of vision-language models PDF

Cannot Refute

[47] AdvLoRA: Adversarial Low-Rank Adaptation of Vision-Language Models PDF

Cannot Refute

[48] Low-rank Prompt Interaction for Continual Vision-Language Retrieval PDF

Cannot Refute

[49] Fine-grained text-based person re-identification via interlaced cross-attention and LoRA fine-tuning: M. Hu et al. PDF

Cannot Refute

[50] MokA: Multimodal Low-Rank Adaptation for MLLMs PDF

Cannot Refute

[51] MultiModal-GPT: A Vision and Language Model for Dialogue with Humans PDF

Cannot Refute

[52] Tracking meets lora: Faster training, larger model, stronger performance PDF

Cannot Refute

[53] Unveiling Bias in Multimodal Models PDF

Cannot Refute

[54] Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA PDF

Cannot Refute

[55] Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning PDF

Cannot Refute

What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

COVAND dataset with CoT and VQA-based pipeline

[56] Reasoning and question answering about image-text multi-modal contexts PDF

NEGTOME text token merging module with negation-aware boost

[7] From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models PDF

[28] SpaceVLM: Sub-Space Modeling of Negation in Vision-Language Models PDF

[37] Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment PDF

[42] Safe Vision-Language Models via Unsafe Weights Manipulation PDF

Parameter-efficient adaptation recipe combining NEGTOME with strategic LoRA

[46] Low-rank few-shot adaptation of vision-language models PDF

[47] AdvLoRA: Adversarial Low-Rank Adaptation of Vision-Language Models PDF

[48] Low-rank Prompt Interaction for Continual Vision-Language Retrieval PDF

[49] Fine-grained text-based person re-identification via interlaced cross-attention and LoRA fine-tuning: M. Hu et al. PDF

[50] MokA: Multimodal Low-Rank Adaptation for MLLMs PDF

[51] MultiModal-GPT: A Vision and Language Model for Dialogue with Humans PDF

[52] Tracking meets lora: Faster training, larger model, stronger performance PDF

[53] Unveiling Bias in Multimodal Models PDF

[54] Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA PDF

[55] Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning PDF

Table of Contents