What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

ICLR 2026 Conference SubmissionAnonymous Authors
Vision Language ModelNegation UnderstandingAffirmative BiasDescribed Object DetectionChain-of-Thought ReasoningToken Merging
Abstract:

State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a dataset pipeline (CoVAND) and a token merging module (NegToMe) for negation understanding in described object detection. It resides in the Token-Level Negation Processing leaf under Architecture and Loss Modifications. This leaf currently contains only this paper, indicating a sparse research direction within the broader negation understanding field. The taxonomy shows 45 papers across the entire field, with most work concentrated in evaluation benchmarks and test-time adaptation rather than architectural token-level interventions during training.

The taxonomy reveals neighboring approaches in sibling branches. Contrastive Learning with Negation-Aware Objectives modifies loss functions rather than token processing, while Test-Time Adaptation methods like Embedding Space Manipulation operate at inference without architectural changes. The parent branch Architecture and Loss Modifications encompasses training-time interventions, distinguishing this work from the larger cluster of evaluation-focused studies under Negation-Specific Benchmarks. The scope note clarifies that token-level architectural modifications during training belong here, separating this from inference-only embedding manipulations or prompt engineering strategies.

Among 15 candidates examined, no contributions were clearly refuted. The CoVAND dataset examined 1 candidate with 0 refutations; NegToMe examined 4 candidates with 0 refutations; the parameter-efficient adaptation recipe examined 10 candidates with 0 refutations. This limited search scope suggests the specific combination of token merging for negation and instance-grounded dataset generation appears novel within the examined literature. The absence of refutations across all three contributions indicates that among the top-15 semantic matches, no prior work directly overlaps with binding negation tokens into coherent semantic phrases at the architectural level.

Based on the top-15 semantic search results, the work appears to occupy a relatively unexplored niche combining architectural token processing with negation-aware data generation. The sparse population of the Token-Level Negation Processing leaf and the absence of refutations suggest novelty, though the limited search scope means potentially relevant work outside the top-15 matches may exist. The analysis covers architectural and dataset contributions but does not exhaustively survey all parameter-efficient adaptation methods in vision-language models.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
15
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: negation understanding in vision-language models. The field addresses a critical weakness in modern VLMs—their tendency to misinterpret or ignore negation cues such as 'not' or 'no' when matching images to text. The taxonomy organizes research into five main branches. Negation-Aware Model Architectures and Training Methods explores modifications to model design and learning objectives, including token-level processing strategies and specialized loss functions. Evaluation and Analysis of Negation Understanding develops benchmarks and diagnostic tools to measure how well models handle negated descriptions. Domain-Specific Applications and Adaptations tailors negation handling to particular settings like medical imaging or spatial reasoning. General Robustness and Continual Learning examines broader reliability concerns that intersect with negation challenges. Auxiliary Methods and Tools provides supporting techniques such as data augmentation and prompt engineering that indirectly improve negation sensitivity. Recent work reveals contrasting approaches to the negation problem. Some studies focus on architectural interventions—Negation Aware VLMs[0] proposes token-level negation processing within the architecture branch, while Text Encoders Bottleneck[5] and NOPE[2] diagnose representational limitations in existing encoders. Others emphasize training strategies: CREPE[6] and Meaning Representation Negative[3] develop specialized contrastive objectives, whereas Dual Path Adapter[4] explores parameter-efficient adaptation. Evaluation efforts like NegVQA[29] and VLMs Not Understand Negation[21] systematically document failure modes across models. Negation Aware VLMs[0] sits within the token-level processing cluster, sharing architectural motivations with works like Text Encoders Bottleneck[5] that identify where negation information is lost, but differing from adapter-based methods like Dual Path Adapter[4] that modify frozen pretrained models. The central tension remains whether negation understanding requires fundamental architectural changes or can emerge from improved training data and objectives.

Claimed Contributions

COVAND dataset with CoT and VQA-based pipeline

The authors present COVAND, a negation-focused dataset generated through a three-step chain-of-thought reasoning process followed by VQA-based caption alignment. This pipeline produces region-grounded positive and negative caption pairs with significantly higher negation word frequency (9.29%) than existing datasets, addressing the scarcity of negation expressions in vision-language training data.

1 retrieved paper
NEGTOME text token merging module with negation-aware boost

The authors introduce NEGTOME, a text token merging module that groups negation cues with their modified attributes into coherent semantic phrases and applies a negation-aware boost to preserve semantic polarity. This is the first work to employ a boosted token merging strategy for preserving semantic polarity in VLM-based detection.

4 retrieved papers
Parameter-efficient adaptation recipe combining NEGTOME with strategic LoRA

The authors propose a lightweight adaptation method that combines NEGTOME with targeted Low-Rank Adaptation (LoRA) applied to deep cross-attention layers. This strategy modifies less than 0.1% of model parameters while achieving significant improvements in negation comprehension across multiple VLM architectures.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

COVAND dataset with CoT and VQA-based pipeline

The authors present COVAND, a negation-focused dataset generated through a three-step chain-of-thought reasoning process followed by VQA-based caption alignment. This pipeline produces region-grounded positive and negative caption pairs with significantly higher negation word frequency (9.29%) than existing datasets, addressing the scarcity of negation expressions in vision-language training data.

Contribution

NEGTOME text token merging module with negation-aware boost

The authors introduce NEGTOME, a text token merging module that groups negation cues with their modified attributes into coherent semantic phrases and applies a negation-aware boost to preserve semantic polarity. This is the first work to employ a boosted token merging strategy for preserving semantic polarity in VLM-based detection.

Contribution

Parameter-efficient adaptation recipe combining NEGTOME with strategic LoRA

The authors propose a lightweight adaptation method that combines NEGTOME with targeted Low-Rank Adaptation (LoRA) applied to deep cross-attention layers. This strategy modifies less than 0.1% of model parameters while achieving significant improvements in negation comprehension across multiple VLM architectures.