Bootstrapping MLLM for Weakly‑Supervised Class‑Agnostic Object Counting
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce WS-COC, a novel framework that leverages multimodal large language models for class-agnostic object counting using only image-level count supervision, eliminating the need for costly point-level annotations while achieving competitive performance with fully-supervised methods.
This strategy reformulates count prediction as a series of range judgment tasks through multi-round dialogue, progressively narrowing the range from coarse to fine, enabling the model to learn counting in a curriculum manner from easy to hard scenarios.
This strategy trains the MLLM to judge relative count differences between images by ranking them according to their object counts, addressing the modality gap between visual features and discrete count values through a more visually accessible comparison task.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[39] A unified approach to count-based weakly supervised learning PDF
[41] Learning from counting: Leveraging temporal classification for weakly supervised object localization and detection PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
WS-COC: first MLLM-driven weakly-supervised framework for class-agnostic object counting
The authors introduce WS-COC, a novel framework that leverages multimodal large language models for class-agnostic object counting using only image-level count supervision, eliminating the need for costly point-level annotations while achieving competitive performance with fully-supervised methods.
[3] TransCrowd: weakly-supervised crowd counting with transformers PDF
[5] Weakly supervised crowd counting with joint CNN and transformer network PDF
[6] Learning to Count Anything: Reference-less Class-agnostic Counting with Weak Supervision PDF
[30] Hypergraph Association Weakly Supervised Crowd Counting PDF
[60] CrowdFormer: Weakly-supervised crowd counting with improved generalizability PDF
[61] Counting Fish with Temporal Representations of Sonar Video PDF
[62] Weakly Supervised Crowd Counting via Depth and Density Perception with Dispersed Attention in Smart Surveillance of HMI PDF
[63] Rethinking the route towards weakly supervised object localization PDF
[64] CLIP-Count: Towards Text-Guided Zero-Shot Object Counting PDF
[65] Learning Object Detection with Weak Supervision PDF
Divide-and-discern dialogue tuning strategy
This strategy reformulates count prediction as a series of range judgment tasks through multi-round dialogue, progressively narrowing the range from coarse to fine, enabling the model to learn counting in a curriculum manner from easy to hard scenarios.
[66] Moments of Connection: Exploring How Caregivers Shape Reciprocal Interactions With Non-Speaking Autistic Children PDF
[67] Improving Language-Focused Comprehension Instruction in Primary-Grade Classrooms: Impacts of the Let's Know! Experimental Curriculum PDF
Compare-and-rank count optimization strategy
This strategy trains the MLLM to judge relative count differences between images by ranking them according to their object counts, addressing the modality gap between visual features and discrete count values through a more visually accessible comparison task.