Bootstrapping MLLM for Weakly‑Supervised Class‑Agnostic Object Counting

ICLR 2026 Conference SubmissionAnonymous Authors
Object countingMLLMsweakly-supervisedclass-agnostic counting
Abstract:

Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, \eg person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: weakly-supervised class-agnostic object counting. The field organizes around several key dimensions. The first branch, Supervision Signal and Annotation Strategy, explores how to reduce labeling costs by moving from dense point annotations to weaker signals such as count-level supervision (Object Counts Labels[7], Count-level Supervision[18]) or even annotation-free methods (Annotation-Free Counting[43]). The second branch, Core Counting Approach and Architecture, encompasses the algorithmic backbone—ranging from density-map regression (TransCrowd[3], CrowdMLP[8]) to transformer-based frameworks (CountFormer[38], CountingDINO[37]) and segmentation-driven pipelines (Counting-by-Segmentation[16]). A third branch, Class-Agnostic Generalization and Multi-Class Handling, addresses the challenge of counting arbitrary object categories without retraining, often leveraging few-shot or zero-shot paradigms (Class-agnostic Few-shot Counting[9], Zero-shot Exemplars[33]). Application Domain and Task Specialization captures domain-specific adaptations (Apple Orchard Counting[4], Microorganism Enumeration[14]), while Surveys, Benchmarks, and Theoretical Frameworks (Class-Agnostic Counting Survey[12]) provide overarching perspectives and evaluation protocols. Recent work has intensified around two contrasting themes: reducing supervision overhead versus improving cross-category generalization. On one hand, methods like Unified Count-based Learning[39] and Temporal Classification Counting[41] push count-level supervision to handle multi-class scenarios with minimal annotation, while on the other, few-shot and zero-shot approaches (Few-shot Occlusion Counting[2], Training-free Baseline[47]) aim for rapid adaptation to novel object types. Bootstrapping MLLM Counting[0] sits within the count-level supervision cluster, closely aligned with Unified Count-based Learning[39] and Temporal Classification Counting[41], yet it distinguishes itself by leveraging multimodal large language models to bootstrap weak count signals into richer representations. Compared to Unified Count-based Learning[39], which unifies multiple count-based tasks under a single framework, Bootstrapping MLLM Counting[0] emphasizes the role of pretrained vision-language models in bridging the gap between minimal supervision and robust class-agnostic performance, reflecting a broader trend toward foundation-model-driven counting solutions.

Claimed Contributions

WS-COC: first MLLM-driven weakly-supervised framework for class-agnostic object counting

The authors introduce WS-COC, a novel framework that leverages multimodal large language models for class-agnostic object counting using only image-level count supervision, eliminating the need for costly point-level annotations while achieving competitive performance with fully-supervised methods.

10 retrieved papers
Divide-and-discern dialogue tuning strategy

This strategy reformulates count prediction as a series of range judgment tasks through multi-round dialogue, progressively narrowing the range from coarse to fine, enabling the model to learn counting in a curriculum manner from easy to hard scenarios.

2 retrieved papers
Compare-and-rank count optimization strategy

This strategy trains the MLLM to judge relative count differences between images by ranking them according to their object counts, addressing the modality gap between visual features and discrete count values through a more visually accessible comparison task.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WS-COC: first MLLM-driven weakly-supervised framework for class-agnostic object counting

The authors introduce WS-COC, a novel framework that leverages multimodal large language models for class-agnostic object counting using only image-level count supervision, eliminating the need for costly point-level annotations while achieving competitive performance with fully-supervised methods.

Contribution

Divide-and-discern dialogue tuning strategy

This strategy reformulates count prediction as a series of range judgment tasks through multi-round dialogue, progressively narrowing the range from coarse to fine, enabling the model to learn counting in a curriculum manner from easy to hard scenarios.

Contribution

Compare-and-rank count optimization strategy

This strategy trains the MLLM to judge relative count differences between images by ranking them according to their object counts, addressing the modality gap between visual features and discrete count values through a more visually accessible comparison task.

Bootstrapping MLLM for Weakly‑Supervised Class‑Agnostic Object Counting | Novelty Validation