Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation

ICLR 2026 Conference SubmissionAnonymous Authors
CAPTCHAmultimodal modelsspatial reasoningrobustnessevaluation benchmark
Abstract:

Online services rely on CAPTCHAs as a first line of defense against automated abuse, yet recent advances in multi-modal large language models (MLLMs) have eroded the effectiveness of conventional designs that focus on text recognition or 2D image understanding. To address this challenge, we present Spatial CAPTCHA, a novel human-verification framework that leverages fundamental differences in spatial reasoning between humans and MLLMs. Unlike existing CAPTCHAs that rely on low-level perception tasks vulnerable to modern AI, Spatial CAPTCHA generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation—skills intuitive for humans but difficult for current AI systems. The system employs a procedural generation pipeline with constraint-based difficulty control, automated correctness verification, and human-in-the-loop validation to ensure scalability, robustness, and adaptability. Evaluation on a corresponding benchmark, Spatial-CAPTCHA-Bench, demonstrates that humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0% Pass@1 accuracy. Result comparison with Google reCAPTCHA further confirms the effectiveness of Spatial CAPTCHA as both a security mechanism and a diagnostic tool for spatial reasoning in AI.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Spatial CAPTCHA, a human-verification framework exploiting spatial reasoning gaps between humans and MLLMs. It resides in the 'Spatial CAPTCHA and Verification Systems' leaf under 'Human-Machine Differentiation and Security Applications', where it is currently the sole paper. This isolation suggests the work occupies a sparse, emerging research direction within the broader spatial reasoning landscape, which comprises 36 papers across diverse benchmarks, methods, and application domains. The taxonomy reveals that while spatial reasoning evaluation is well-populated, security-oriented applications leveraging these gaps remain underexplored.

The taxonomy tree shows neighboring leaves include 'Human-Machine Performance Comparison Studies' (2 papers) and 'Object Authenticity and Visual Discrimination' (1 paper), both focused on empirical performance analysis rather than security applications. Broader sibling branches address 'Spatial Reasoning Benchmarks' (13 papers across 4 leaves) and 'Methods for Enhancing Spatial Reasoning' (5 papers across 3 leaves). The original paper diverges from these directions by applying observed spatial reasoning deficits to practical verification tasks, rather than benchmarking capabilities or improving model performance. Its scope_note explicitly excludes general performance comparisons and object classification, positioning it as a security-focused application distinct from adjacent evaluation-centric work.

Among 23 candidates examined, none clearly refute the three core contributions. The 'Spatial CAPTCHA framework' contribution examined 10 candidates with zero refutable overlaps; the 'Procedural generation pipeline' examined 3 candidates with similar results; and the 'Spatial-CAPTCHA-Bench' examined 10 candidates, also finding no prior work providing overlapping benchmarks. This limited search scope suggests that within the top-K semantic matches and citation expansions analyzed, no existing work combines spatial reasoning challenges with CAPTCHA-style verification and automated generation pipelines. However, the analysis does not claim exhaustive coverage of all possible prior art in security or spatial reasoning domains.

Given the limited search scope of 23 candidates, the work appears novel in its specific application of spatial reasoning to human verification. The taxonomy context reinforces this impression: the leaf contains only this paper, and adjacent leaves focus on performance analysis rather than security. However, the analysis cannot rule out relevant prior work in broader CAPTCHA literature, adversarial robustness, or cognitive security domains not captured by the semantic search. The novelty assessment is thus conditional on the examined candidate set and the taxonomy's coverage of spatial reasoning research.

Taxonomy

Core-task Taxonomy Papers
36
3
Claimed Contributions
23
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: spatial reasoning for human-machine differentiation. The field encompasses diverse branches that collectively address how AI systems understand and manipulate spatial information, and how these capabilities can distinguish human from machine cognition. The taxonomy reveals several major directions: benchmarks and evaluation frameworks that systematically test spatial abilities across modalities; methods for enhancing spatial reasoning through architectural innovations, training strategies, and cognitive-inspired approaches; human-machine differentiation and security applications that exploit spatial reasoning gaps; application domains spanning robotics, autonomous driving, and interactive systems; cognitive foundations drawing from developmental psychology and neuroscience; and modality-specific challenges in vision, language, and embodied settings. Representative works like Visual Spatial Reasoning[1] and Whats Up VLMs[2] illustrate benchmark development, while Picture Worth Spatial[3] and TopViewRS[4] explore domain-specific spatial understanding. Recent activity highlights contrasting themes: some lines pursue comprehensive benchmarking of vision-language models on spatial tasks (e.g., 11plus bench[5], Sphere Blind Spots[6]), revealing persistent weaknesses in rotation, orientation, and relational reasoning, while others investigate cognitive and theoretical underpinnings (Reframing Spatial Reasoning[9], Mind Meets Space[10]) to inform better architectures. The original paper, Spatial CAPTCHA[0], sits squarely within the human-machine differentiation branch alongside Vision CAPTCHA Reasoning[22], leveraging spatial reasoning challenges as verification mechanisms. Unlike broader benchmarks that assess general spatial competence, this work focuses on exploiting the gap between human intuitive spatial processing and current AI limitations for security purposes. This positions it as a practical application of observed weaknesses, contrasting with efforts like CoT Spatial Reasoning[16] or Multimodal Spatial Struggle[25] that aim to close the performance gap through improved reasoning strategies.

Claimed Contributions

Spatial CAPTCHA framework for human-machine differentiation

The authors introduce a new CAPTCHA system that exploits the gap between human and machine spatial reasoning capabilities. The framework generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation—skills intuitive for humans but challenging for current AI systems.

10 retrieved papers
Procedural generation pipeline with constraint-based difficulty control

The authors develop an autonomous pipeline that can generate unlimited CAPTCHA instances with controlled difficulty levels. The system includes mechanisms for automated correctness verification and human validation to ensure the generated challenges are both scalable and robust for real-world deployment.

3 retrieved papers
Spatial-CAPTCHA-Bench benchmark dataset

The authors create a benchmark comprising 1050 instances across seven task formulations and four spatial-ability categories, stratified into three difficulty levels. This benchmark enables standardized offline evaluation of both human and machine spatial reasoning capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Spatial CAPTCHA framework for human-machine differentiation

The authors introduce a new CAPTCHA system that exploits the gap between human and machine spatial reasoning capabilities. The framework generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation—skills intuitive for humans but challenging for current AI systems.

Contribution

Procedural generation pipeline with constraint-based difficulty control

The authors develop an autonomous pipeline that can generate unlimited CAPTCHA instances with controlled difficulty levels. The system includes mechanisms for automated correctness verification and human validation to ensure the generated challenges are both scalable and robust for real-world deployment.

Contribution

Spatial-CAPTCHA-Bench benchmark dataset

The authors create a benchmark comprising 1050 instances across seven task formulations and four spatial-ability categories, stratified into three difficulty levels. This benchmark enables standardized offline evaluation of both human and machine spatial reasoning capabilities.