Fantastic Tractor-Dogs and How Not to Find Them With Open-Vocabulary Detectors

ICLR 2026 Conference SubmissionAnonymous Authors
open-vocabularyobject detectionvision-languagefalse positives
Abstract:

Open-Vocabulary Detectors (OVDs) excel in zero-shot benchmarks, but we observe a critical flaw in real-world deployment: a high rate of confident false positive predictions on images that do not contain any target objects (e.g., detecting a tractor in an image of a dog). This issue is masked by standard benchmarks like COCO and LVIS, as they rarely contain images without any of the target classes present. We identify vision-language fusion layers in early-fusion OVD architectures (e.g., Grounding DINO or LLMDet) as the root cause, and show how they distribute irrelevant class information across image features when no prompted object is present. To mitigate background false positives without costly retraining, we propose a simple, training-free method: appending attention sink tokens to the input prompt. We show that such sinks can redirect spurious attention and dramatically reduce background false positives. Our approach significantly improves the performance of all six early-fusion models tested (e.g., boosting AP on LVIS by more than 5x at a false positive rate of 0.01 for some models), making them practical for real-world applications where images without the object of interest are much more prevalent.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a training-free method using attention sink tokens to mitigate background false positives in early-fusion open-vocabulary detectors. It resides in the 'Attention Sink Tokens for Background Suppression' leaf under 'Attention and Feature Fusion Mechanisms', which currently contains only this work. This indicates a relatively sparse research direction within the broader taxonomy of 26 papers across multiple branches. The taxonomy shows that most prior efforts concentrate on training-based refinement, post-processing calibration, or objectness modeling, leaving architectural attention mechanisms less explored.

The paper's leaf sits alongside 'Disentangled Representation Learning' within the same parent branch, suggesting that attention and fusion mechanisms are an emerging but not yet crowded area. Neighboring branches include 'Training-Based Refinement' (four leaves covering pseudo-label filtering, retrieval augmentation, negative prompt learning, and background sample handling) and 'Post-Processing and Inference-Time Calibration' (three leaves for temperature scaling, confidence aggregation, and linear probing). The taxonomy's scope note clarifies that this branch excludes post-processing and training-based methods, positioning the work as an architectural intervention distinct from calibration or retraining strategies.

Among 21 candidates examined, no contributions were clearly refuted. The first contribution (identifying background false positives) examined 4 candidates with 0 refutable; the second (fusion layer analysis) examined 10 candidates with 0 refutable; the third (attention sink method) examined 7 candidates with 0 refutable. This suggests that within the limited search scope, no prior work directly overlaps with the specific combination of problem identification, mechanistic explanation, and training-free sink token solution. However, the search scale is modest, and the absence of refutation reflects the examined sample rather than exhaustive coverage.

Given the limited search scope of 21 candidates, the work appears novel in its specific approach to background false positives through attention sinks. The taxonomy structure confirms that attention-based architectural interventions are less populated than training or calibration methods. While the analysis does not guarantee no prior work exists beyond the examined candidates, the combination of problem framing, mechanistic insight, and training-free solution appears distinct within the surveyed literature.

Taxonomy

Core-task Taxonomy Papers
26
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Mitigating false positive predictions in open-vocabulary object detection. The field addresses the challenge of reducing spurious detections when models must recognize objects beyond their training vocabularies. The taxonomy reveals several complementary strategies: some branches focus on attention and feature fusion mechanisms that refine how visual and textual representations interact, while others emphasize training-based refinement through improved pseudo-labeling or post-processing calibration at inference time. A distinct line of work tackles objectness modeling and open-world scenarios, and additional branches explore few-shot settings, unified architectures, domain-specific applications, and rigorous benchmarking frameworks. Representative efforts include Taming Self-Training[1] for pseudo-label quality, Retrieval-Augmented OVOD[2] for feature enhancement, and Transferable Negative Prompts[3] for background suppression, illustrating the diversity of technical approaches. Particularly active themes emerge around inference-time calibration versus training-time refinement, with works like Optimal Temperature Scaling[4] and Training-free Confidence Aggregation[5] exploring lightweight post-hoc adjustments, while Open-World Objectness[6] and Linear Probing OVOD[7] investigate deeper architectural or learning modifications. Fantastic Tractor-Dogs[0] situates itself within the attention and feature fusion branch, specifically targeting background suppression through attention sink tokens—a mechanism closely related to the negative prompt strategies in Transferable Negative Prompts[3] but differing in its focus on internal attention dynamics rather than external prompt engineering. Compared to calibration-focused methods like Training-free Confidence Aggregation[5], Fantastic Tractor-Dogs[0] intervenes earlier in the feature extraction pipeline, aiming to prevent false positives at the representation level rather than correcting scores afterward. This positioning highlights an ongoing tension between architectural interventions and post-processing solutions in managing open-vocabulary detection reliability.

Claimed Contributions

Identification and quantification of background false positive problem in early-fusion OVDs

The authors identify and quantify a critical flaw in early-fusion open-vocabulary detectors: high rates of confident false positive predictions on background-only images (images without target objects). They demonstrate that standard benchmarks like COCO and LVIS mask this issue because they rarely contain images without target classes, and propose an adaptation to existing benchmarks to measure background false positive rates.

4 retrieved papers
Explanation of false positives through vision-language fusion layer analysis

The authors establish that cross-modal attention operations in vision-language fusion layers of early-fusion models cause high background false positive rates. They show that these layers distribute irrelevant class information across image features when no prompted object is present, unlike late-interaction models which do not exhibit this behavior.

10 retrieved papers
Training-free attention sink method for mitigating background false positives

The authors propose a simple, training-free solution that appends attention sink tokens to input prompts, which redirect spurious attention and dramatically reduce background false positives. This approach significantly improves performance across all six tested early-fusion models (e.g., boosting AP on LVIS by more than 5x at a false positive rate of 0.01 for some models) with minimal impact on positive sample detection.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification and quantification of background false positive problem in early-fusion OVDs

The authors identify and quantify a critical flaw in early-fusion open-vocabulary detectors: high rates of confident false positive predictions on background-only images (images without target objects). They demonstrate that standard benchmarks like COCO and LVIS mask this issue because they rarely contain images without target classes, and propose an adaptation to existing benchmarks to measure background false positive rates.

Contribution

Explanation of false positives through vision-language fusion layer analysis

The authors establish that cross-modal attention operations in vision-language fusion layers of early-fusion models cause high background false positive rates. They show that these layers distribute irrelevant class information across image features when no prompted object is present, unlike late-interaction models which do not exhibit this behavior.

Contribution

Training-free attention sink method for mitigating background false positives

The authors propose a simple, training-free solution that appends attention sink tokens to input prompts, which redirect spurious attention and dramatically reduce background false positives. This approach significantly improves performance across all six tested early-fusion models (e.g., boosting AP on LVIS by more than 5x at a false positive rate of 0.01 for some models) with minimal impact on positive sample detection.