Fantastic Tractor-Dogs and How Not to Find Them With Open-Vocabulary Detectors

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

open-vocabularyobject detectionvision-languagefalse positives

Open-Vocabulary Detectors (OVDs) excel in zero-shot benchmarks, but we observe a critical flaw in real-world deployment: a high rate of confident false positive predictions on images that do not contain any target objects (e.g., detecting a tractor in an image of a dog). This issue is masked by standard benchmarks like COCO and LVIS, as they rarely contain images without any of the target classes present. We identify vision-language fusion layers in early-fusion OVD architectures (e.g., Grounding DINO or LLMDet) as the root cause, and show how they distribute irrelevant class information across image features when no prompted object is present. To mitigate background false positives without costly retraining, we propose a simple, training-free method: appending attention sink tokens to the input prompt. We show that such sinks can redirect spurious attention and dramatically reduce background false positives. Our approach significantly improves the performance of all six early-fusion models tested (e.g., boosting AP on LVIS by more than 5x at a false positive rate of 0.01 for some models), making them practical for real-world applications where images without the object of interest are much more prevalent.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a training-free method using attention sink tokens to mitigate background false positives in early-fusion open-vocabulary detectors. It resides in the 'Attention Sink Tokens for Background Suppression' leaf under 'Attention and Feature Fusion Mechanisms', which currently contains only this work. This indicates a relatively sparse research direction within the broader taxonomy of 26 papers across multiple branches. The taxonomy shows that most prior efforts concentrate on training-based refinement, post-processing calibration, or objectness modeling, leaving architectural attention mechanisms less explored.

The paper's leaf sits alongside 'Disentangled Representation Learning' within the same parent branch, suggesting that attention and fusion mechanisms are an emerging but not yet crowded area. Neighboring branches include 'Training-Based Refinement' (four leaves covering pseudo-label filtering, retrieval augmentation, negative prompt learning, and background sample handling) and 'Post-Processing and Inference-Time Calibration' (three leaves for temperature scaling, confidence aggregation, and linear probing). The taxonomy's scope note clarifies that this branch excludes post-processing and training-based methods, positioning the work as an architectural intervention distinct from calibration or retraining strategies.

Among 21 candidates examined, no contributions were clearly refuted. The first contribution (identifying background false positives) examined 4 candidates with 0 refutable; the second (fusion layer analysis) examined 10 candidates with 0 refutable; the third (attention sink method) examined 7 candidates with 0 refutable. This suggests that within the limited search scope, no prior work directly overlaps with the specific combination of problem identification, mechanistic explanation, and training-free sink token solution. However, the search scale is modest, and the absence of refutation reflects the examined sample rather than exhaustive coverage.

Given the limited search scope of 21 candidates, the work appears novel in its specific approach to background false positives through attention sinks. The taxonomy structure confirms that attention-based architectural interventions are less populated than training or calibration methods. While the analysis does not guarantee no prior work exists beyond the examined candidates, the combination of problem framing, mechanistic insight, and training-free solution appears distinct within the surveyed literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Mitigating false positive predictions in open-vocabulary object detection. The field addresses the challenge of reducing spurious detections when models must recognize objects beyond their training vocabularies. The taxonomy reveals several complementary strategies: some branches focus on attention and feature fusion mechanisms that refine how visual and textual representations interact, while others emphasize training-based refinement through improved pseudo-labeling or post-processing calibration at inference time. A distinct line of work tackles objectness modeling and open-world scenarios, and additional branches explore few-shot settings, unified architectures, domain-specific applications, and rigorous benchmarking frameworks. Representative efforts include Taming Self-Training[1] for pseudo-label quality, Retrieval-Augmented OVOD[2] for feature enhancement, and Transferable Negative Prompts[3] for background suppression, illustrating the diversity of technical approaches. Particularly active themes emerge around inference-time calibration versus training-time refinement, with works like Optimal Temperature Scaling[4] and Training-free Confidence Aggregation[5] exploring lightweight post-hoc adjustments, while Open-World Objectness[6] and Linear Probing OVOD[7] investigate deeper architectural or learning modifications. Fantastic Tractor-Dogs[0] situates itself within the attention and feature fusion branch, specifically targeting background suppression through attention sink tokens—a mechanism closely related to the negative prompt strategies in Transferable Negative Prompts[3] but differing in its focus on internal attention dynamics rather than external prompt engineering. Compared to calibration-focused methods like Training-free Confidence Aggregation[5], Fantastic Tractor-Dogs[0] intervenes earlier in the feature extraction pipeline, aiming to prevent false positives at the representation level rather than correcting scores afterward. This positioning highlights an ongoing tension between architectural interventions and post-processing solutions in managing open-vocabulary detection reliability.

Claimed Contributions

Identification and quantification of background false positive problem in early-fusion OVDs

4 retrieved papers

The authors identify and quantify a critical flaw in early-fusion open-vocabulary detectors: high rates of confident false positive predictions on background-only images (images without target objects). They demonstrate that standard benchmarks like COCO and LVIS mask this issue because they rarely contain images without target classes, and propose an adaptation to existing benchmarks to measure background false positive rates.

4 retrieved papers

Explanation of false positives through vision-language fusion layer analysis

10 retrieved papers

The authors establish that cross-modal attention operations in vision-language fusion layers of early-fusion models cause high background false positive rates. They show that these layers distribute irrelevant class information across image features when no prompted object is present, unlike late-interaction models which do not exhibit this behavior.

10 retrieved papers

Training-free attention sink method for mitigating background false positives

7 retrieved papers

The authors propose a simple, training-free solution that appends attention sink tokens to input prompts, which redirect spurious attention and dramatically reduce background false positives. This approach significantly improves performance across all six tested early-fusion models (e.g., boosting AP on LVIS by more than 5x at a false positive rate of 0.01 for some models) with minimal impact on positive sample detection.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification and quantification of background false positive problem in early-fusion OVDs

[11] Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark PDF

Cannot Refute

[37] Marvelovd: Marrying object recognition and vision-language models for robust open-vocabulary object detection PDF

Cannot Refute

[38] Integrazione dei Foundation Models nelle Architetture Cognitive: Percezione e Pianificazione in Ambienti Dinamici e Non Strutturati PDF

Cannot Refute

[39] From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors PDF

Cannot Refute

Contribution

Explanation of false positives through vision-language fusion layer analysis

[27] Hallucination of multimodal large language models: A survey PDF

Cannot Refute

[28] Vmad: Visual-enhanced multimodal large language model for zero-shot anomaly detection PDF

Cannot Refute

[29] Cross-modal Causal Relation Alignment for Video Question Grounding PDF

Cannot Refute

[30] A Review of Multi-Sensor Fusion in Autonomous Driving PDF

Cannot Refute

[31] A Dual-state Based Surface Anomaly Detection Model for Rail Transit Trains Using Vision-language Model PDF

Cannot Refute

[32] Entity-Aware Cross-Modal Fusion Network for Fine-Grained Entity Consistency Verification in Multimodal News Misinformation Detection PDF

Cannot Refute

[33] Cross-Modal Vision Representation Learning for Real-World Visual Understanding PDF

Cannot Refute

[34] Multimodal Vision-Language Modeling for Advanced Quantitative Analysis of Positron Emission Tomography Imaging PDF

Cannot Refute

[35] Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models PDF

Cannot Refute

[36] Cross-modal Mitigation of Spurious Correlation for Prompt-tuning in VLMs with Causally Motivated Logic Alignment PDF

Cannot Refute

Contribution

Training-free attention sink method for mitigating background false positives

[40] Paying more attention to image: A training-free method for alleviating hallucination in lvlms PDF

Cannot Refute

[41] See What You Are Told: Visual Attention Sink in Large Multimodal Models PDF

Cannot Refute

[42] Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLMs PDF

Cannot Refute

[43] Seeing clearly by layer two: Enhancing attention heads to alleviate hallucination in lvlms PDF

Cannot Refute

[44] To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models PDF

Cannot Refute

[45] How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads PDF

Cannot Refute

[46] Vocabulary Fixation Reveals Visual Attention Sink for Hallucination Mitigation in LVLMs PDF

Cannot Refute

Fantastic Tractor-Dogs and How Not to Find Them With Open-Vocabulary Detectors

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Identification and quantification of background false positive problem in early-fusion OVDs

[11] Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark PDF

[37] Marvelovd: Marrying object recognition and vision-language models for robust open-vocabulary object detection PDF

[38] Integrazione dei Foundation Models nelle Architetture Cognitive: Percezione e Pianificazione in Ambienti Dinamici e Non Strutturati PDF

[39] From COCO to COCO-FP: A Deep Dive into Background False Positives for COCO Detectors PDF

Explanation of false positives through vision-language fusion layer analysis

[27] Hallucination of multimodal large language models: A survey PDF

[28] Vmad: Visual-enhanced multimodal large language model for zero-shot anomaly detection PDF

[29] Cross-modal Causal Relation Alignment for Video Question Grounding PDF

[30] A Review of Multi-Sensor Fusion in Autonomous Driving PDF

[31] A Dual-state Based Surface Anomaly Detection Model for Rail Transit Trains Using Vision-language Model PDF

[32] Entity-Aware Cross-Modal Fusion Network for Fine-Grained Entity Consistency Verification in Multimodal News Misinformation Detection PDF

[33] Cross-Modal Vision Representation Learning for Real-World Visual Understanding PDF

[34] Multimodal Vision-Language Modeling for Advanced Quantitative Analysis of Positron Emission Tomography Imaging PDF

[35] Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models PDF

[36] Cross-modal Mitigation of Spurious Correlation for Prompt-tuning in VLMs with Causally Motivated Logic Alignment PDF

Training-free attention sink method for mitigating background false positives

[40] Paying more attention to image: A training-free method for alleviating hallucination in lvlms PDF

[41] See What You Are Told: Visual Attention Sink in Large Multimodal Models PDF

[42] Shallow Focus, Deep Fixes: Enhancing Shallow Layers Vision Attention Sinks to Alleviate Hallucination in LVLMs PDF

[43] Seeing clearly by layer two: Enhancing attention heads to alleviate hallucination in lvlms PDF

[44] To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models PDF

[45] How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads PDF

[46] Vocabulary Fixation Reveals Visual Attention Sink for Hallucination Mitigation in LVLMs PDF

Table of Contents