Bootstrapping MLLM for Weakly‑Supervised Class‑Agnostic Object Counting

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Object countingMLLMsweakly-supervisedclass-agnostic counting

Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, \eg person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: weakly-supervised class-agnostic object counting. The field organizes around several key dimensions. The first branch, Supervision Signal and Annotation Strategy, explores how to reduce labeling costs by moving from dense point annotations to weaker signals such as count-level supervision (Object Counts Labels[7], Count-level Supervision[18]) or even annotation-free methods (Annotation-Free Counting[43]). The second branch, Core Counting Approach and Architecture, encompasses the algorithmic backbone—ranging from density-map regression (TransCrowd[3], CrowdMLP[8]) to transformer-based frameworks (CountFormer[38], CountingDINO[37]) and segmentation-driven pipelines (Counting-by-Segmentation[16]). A third branch, Class-Agnostic Generalization and Multi-Class Handling, addresses the challenge of counting arbitrary object categories without retraining, often leveraging few-shot or zero-shot paradigms (Class-agnostic Few-shot Counting[9], Zero-shot Exemplars[33]). Application Domain and Task Specialization captures domain-specific adaptations (Apple Orchard Counting[4], Microorganism Enumeration[14]), while Surveys, Benchmarks, and Theoretical Frameworks (Class-Agnostic Counting Survey[12]) provide overarching perspectives and evaluation protocols. Recent work has intensified around two contrasting themes: reducing supervision overhead versus improving cross-category generalization. On one hand, methods like Unified Count-based Learning[39] and Temporal Classification Counting[41] push count-level supervision to handle multi-class scenarios with minimal annotation, while on the other, few-shot and zero-shot approaches (Few-shot Occlusion Counting[2], Training-free Baseline[47]) aim for rapid adaptation to novel object types. Bootstrapping MLLM Counting[0] sits within the count-level supervision cluster, closely aligned with Unified Count-based Learning[39] and Temporal Classification Counting[41], yet it distinguishes itself by leveraging multimodal large language models to bootstrap weak count signals into richer representations. Compared to Unified Count-based Learning[39], which unifies multiple count-based tasks under a single framework, Bootstrapping MLLM Counting[0] emphasizes the role of pretrained vision-language models in bridging the gap between minimal supervision and robust class-agnostic performance, reflecting a broader trend toward foundation-model-driven counting solutions.

Claimed Contributions

WS-COC: first MLLM-driven weakly-supervised framework for class-agnostic object counting

10 retrieved papers

The authors introduce WS-COC, a novel framework that leverages multimodal large language models for class-agnostic object counting using only image-level count supervision, eliminating the need for costly point-level annotations while achieving competitive performance with fully-supervised methods.

10 retrieved papers

Divide-and-discern dialogue tuning strategy

2 retrieved papers

This strategy reformulates count prediction as a series of range judgment tasks through multi-round dialogue, progressively narrowing the range from coarse to fine, enabling the model to learn counting in a curriculum manner from easy to hard scenarios.

2 retrieved papers

Compare-and-rank count optimization strategy

Can Refute

10 retrieved papers

This strategy trains the MLLM to judge relative count differences between images by ranking them according to their object counts, addressing the modality gap between visual features and discrete count values through a more visually accessible comparison task.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[39] A unified approach to count-based weakly supervised learning PDF

Shukla Vinay, V J Shukla, Zeng Zhe, Zhe Zeng, Vinay Shukla, Ahmed, Kareem, Kareem Ahmed, Broeck, Guy Van den, Guy Van den Broeck (2023)

[41] Learning from counting: Leveraging temporal classification for weakly supervised object localization and detection PDF

Chia-Yu Hsu, Wenwen Li (2021)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WS-COC: first MLLM-driven weakly-supervised framework for class-agnostic object counting

[3] TransCrowd: weakly-supervised crowd counting with transformers PDF

Cannot Refute

[5] Weakly supervised crowd counting with joint CNN and transformer network PDF

Cannot Refute

[6] Learning to Count Anything: Reference-less Class-agnostic Counting with Weak Supervision PDF

Cannot Refute

[30] Hypergraph Association Weakly Supervised Crowd Counting PDF

Cannot Refute

[60] CrowdFormer: Weakly-supervised crowd counting with improved generalizability PDF

Cannot Refute

[61] Counting Fish with Temporal Representations of Sonar Video PDF

Cannot Refute

[62] Weakly Supervised Crowd Counting via Depth and Density Perception with Dispersed Attention in Smart Surveillance of HMI PDF

Cannot Refute

[63] Rethinking the route towards weakly supervised object localization PDF

Cannot Refute

[64] CLIP-Count: Towards Text-Guided Zero-Shot Object Counting PDF

Cannot Refute

[65] Learning Object Detection with Weak Supervision PDF

Cannot Refute

Contribution

Divide-and-discern dialogue tuning strategy

[66] Moments of Connection: Exploring How Caregivers Shape Reciprocal Interactions With Non-Speaking Autistic Children PDF

Cannot Refute

[67] Improving Language-Focused Comprehension Instruction in Primary-Grade Classrooms: Impacts of the Let's Know! Experimental Curriculum PDF

Cannot Refute

Contribution

Compare-and-rank count optimization strategy

[35] Glance to count: Learning to rank with anchors for weakly-supervised crowd counting PDF

Can Refute

[52] Leveraging unlabeled data for crowd counting by learning to rank PDF

Can Refute

[55] Learning-to-Count by Learning-to-Rank PDF

Can Refute

[51] IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout PDF

Cannot Refute

[53] Afreeca: Annotation-free counting for all PDF

Cannot Refute

[54] Training a regression-based model for crowd counting in transit cars using ranked image pairs and triplets PDF

Cannot Refute

[56] Remote Sensing Object Counting Through Regression Ensembles and Learning to Rank PDF

Cannot Refute

[57] Coda: Counting objects via scale-aware adversarial density adaption PDF

Cannot Refute

[58] Text-promptable Object Counting via Quantity Awareness Enhancement PDF

Cannot Refute

[59] Rank-Aware Agglomeration of Foundation Models for Immunohistochemistry Image Cell Counting PDF

Cannot Refute

Bootstrapping MLLM for Weakly‑Supervised Class‑Agnostic Object Counting

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[39] A unified approach to count-based weakly supervised learning PDF

[41] Learning from counting: Leveraging temporal classification for weakly supervised object localization and detection PDF

Contribution Analysis

WS-COC: first MLLM-driven weakly-supervised framework for class-agnostic object counting

[3] TransCrowd: weakly-supervised crowd counting with transformers PDF

[5] Weakly supervised crowd counting with joint CNN and transformer network PDF

[6] Learning to Count Anything: Reference-less Class-agnostic Counting with Weak Supervision PDF

[30] Hypergraph Association Weakly Supervised Crowd Counting PDF

[60] CrowdFormer: Weakly-supervised crowd counting with improved generalizability PDF

[61] Counting Fish with Temporal Representations of Sonar Video PDF

[62] Weakly Supervised Crowd Counting via Depth and Density Perception with Dispersed Attention in Smart Surveillance of HMI PDF

[63] Rethinking the route towards weakly supervised object localization PDF

[64] CLIP-Count: Towards Text-Guided Zero-Shot Object Counting PDF

[65] Learning Object Detection with Weak Supervision PDF

Divide-and-discern dialogue tuning strategy

[66] Moments of Connection: Exploring How Caregivers Shape Reciprocal Interactions With Non-Speaking Autistic Children PDF

[67] Improving Language-Focused Comprehension Instruction in Primary-Grade Classrooms: Impacts of the Let's Know! Experimental Curriculum PDF

Compare-and-rank count optimization strategy

[35] Glance to count: Learning to rank with anchors for weakly-supervised crowd counting PDF

[52] Leveraging unlabeled data for crowd counting by learning to rank PDF

[55] Learning-to-Count by Learning-to-Rank PDF

[51] IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout PDF

[53] Afreeca: Annotation-free counting for all PDF

[54] Training a regression-based model for crowd counting in transit cars using ranked image pairs and triplets PDF

[56] Remote Sensing Object Counting Through Regression Ensembles and Learning to Rank PDF

[57] Coda: Counting objects via scale-aware adversarial density adaption PDF

[58] Text-promptable Object Counting via Quantity Awareness Enhancement PDF

[59] Rank-Aware Agglomeration of Foundation Models for Immunohistochemistry Image Cell Counting PDF

Table of Contents