Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

CAPTCHAmultimodal modelsspatial reasoningrobustnessevaluation benchmark

Online services rely on CAPTCHAs as a first line of defense against automated abuse, yet recent advances in multi-modal large language models (MLLMs) have eroded the effectiveness of conventional designs that focus on text recognition or 2D image understanding. To address this challenge, we present Spatial CAPTCHA, a novel human-verification framework that leverages fundamental differences in spatial reasoning between humans and MLLMs. Unlike existing CAPTCHAs that rely on low-level perception tasks vulnerable to modern AI, Spatial CAPTCHA generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation—skills intuitive for humans but difficult for current AI systems. The system employs a procedural generation pipeline with constraint-based difficulty control, automated correctness verification, and human-in-the-loop validation to ensure scalability, robustness, and adaptability. Evaluation on a corresponding benchmark, Spatial-CAPTCHA-Bench, demonstrates that humans vastly outperform 10 state-of-the-art MLLMs, with the best model achieving only 31.0% Pass@1 accuracy. Result comparison with Google reCAPTCHA further confirms the effectiveness of Spatial CAPTCHA as both a security mechanism and a diagnostic tool for spatial reasoning in AI.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Spatial CAPTCHA, a human-verification framework exploiting spatial reasoning gaps between humans and MLLMs. It resides in the 'Spatial CAPTCHA and Verification Systems' leaf under 'Human-Machine Differentiation and Security Applications', where it is currently the sole paper. This isolation suggests the work occupies a sparse, emerging research direction within the broader spatial reasoning landscape, which comprises 36 papers across diverse benchmarks, methods, and application domains. The taxonomy reveals that while spatial reasoning evaluation is well-populated, security-oriented applications leveraging these gaps remain underexplored.

The taxonomy tree shows neighboring leaves include 'Human-Machine Performance Comparison Studies' (2 papers) and 'Object Authenticity and Visual Discrimination' (1 paper), both focused on empirical performance analysis rather than security applications. Broader sibling branches address 'Spatial Reasoning Benchmarks' (13 papers across 4 leaves) and 'Methods for Enhancing Spatial Reasoning' (5 papers across 3 leaves). The original paper diverges from these directions by applying observed spatial reasoning deficits to practical verification tasks, rather than benchmarking capabilities or improving model performance. Its scope_note explicitly excludes general performance comparisons and object classification, positioning it as a security-focused application distinct from adjacent evaluation-centric work.

Among 23 candidates examined, none clearly refute the three core contributions. The 'Spatial CAPTCHA framework' contribution examined 10 candidates with zero refutable overlaps; the 'Procedural generation pipeline' examined 3 candidates with similar results; and the 'Spatial-CAPTCHA-Bench' examined 10 candidates, also finding no prior work providing overlapping benchmarks. This limited search scope suggests that within the top-K semantic matches and citation expansions analyzed, no existing work combines spatial reasoning challenges with CAPTCHA-style verification and automated generation pipelines. However, the analysis does not claim exhaustive coverage of all possible prior art in security or spatial reasoning domains.

Given the limited search scope of 23 candidates, the work appears novel in its specific application of spatial reasoning to human verification. The taxonomy context reinforces this impression: the leaf contains only this paper, and adjacent leaves focus on performance analysis rather than security. However, the analysis cannot rule out relevant prior work in broader CAPTCHA literature, adversarial robustness, or cognitive security domains not captured by the semantic search. The novelty assessment is thus conditional on the examined candidate set and the taxonomy's coverage of spatial reasoning research.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: spatial reasoning for human-machine differentiation. The field encompasses diverse branches that collectively address how AI systems understand and manipulate spatial information, and how these capabilities can distinguish human from machine cognition. The taxonomy reveals several major directions: benchmarks and evaluation frameworks that systematically test spatial abilities across modalities; methods for enhancing spatial reasoning through architectural innovations, training strategies, and cognitive-inspired approaches; human-machine differentiation and security applications that exploit spatial reasoning gaps; application domains spanning robotics, autonomous driving, and interactive systems; cognitive foundations drawing from developmental psychology and neuroscience; and modality-specific challenges in vision, language, and embodied settings. Representative works like Visual Spatial Reasoning[1] and Whats Up VLMs[2] illustrate benchmark development, while Picture Worth Spatial[3] and TopViewRS[4] explore domain-specific spatial understanding. Recent activity highlights contrasting themes: some lines pursue comprehensive benchmarking of vision-language models on spatial tasks (e.g., 11plus bench[5], Sphere Blind Spots[6]), revealing persistent weaknesses in rotation, orientation, and relational reasoning, while others investigate cognitive and theoretical underpinnings (Reframing Spatial Reasoning[9], Mind Meets Space[10]) to inform better architectures. The original paper, Spatial CAPTCHA[0], sits squarely within the human-machine differentiation branch alongside Vision CAPTCHA Reasoning[22], leveraging spatial reasoning challenges as verification mechanisms. Unlike broader benchmarks that assess general spatial competence, this work focuses on exploiting the gap between human intuitive spatial processing and current AI limitations for security purposes. This positions it as a practical application of observed weaknesses, contrasting with efforts like CoT Spatial Reasoning[16] or Multimodal Spatial Struggle[25] that aim to close the performance gap through improved reasoning strategies.

Claimed Contributions

Spatial CAPTCHA framework for human-machine differentiation

10 retrieved papers

The authors introduce a new CAPTCHA system that exploits the gap between human and machine spatial reasoning capabilities. The framework generates dynamic questions requiring geometric reasoning, perspective-taking, occlusion handling, and mental rotation—skills intuitive for humans but challenging for current AI systems.

10 retrieved papers

Procedural generation pipeline with constraint-based difficulty control

3 retrieved papers

The authors develop an autonomous pipeline that can generate unlimited CAPTCHA instances with controlled difficulty levels. The system includes mechanisms for automated correctness verification and human validation to ensure the generated challenges are both scalable and robust for real-world deployment.

3 retrieved papers

Spatial-CAPTCHA-Bench benchmark dataset

10 retrieved papers

The authors create a benchmark comprising 1050 instances across seven task formulations and four spatial-ability categories, stratified into three difficulty levels. This benchmark enables standardized offline evaluation of both human and machine spatial reasoning capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Spatial CAPTCHA framework for human-machine differentiation

[22] Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA PDF

Cannot Refute

[47] Vrc-graphnet: A graph neural network-based reasoning framework for attacking visual reasoning captchas PDF

Cannot Refute

[48] Robust CAPTCHAs towards malicious OCR PDF

Cannot Refute

[49] NGCaptcha: A CAPTCHA Bridging the Past and the Future PDF

Cannot Refute

[50] Adversarial Text-Based CAPTCHA Generation Method Utilizing Spatial Smoothing PDF

Cannot Refute

[51] MF-GGNN: Crack Visual Reasoning CAPTCHA Holistically Using a Novel Multi-Feature Fusion-Based Graph Gated Neural Network PDF

Cannot Refute

[52] Designing Cognitive 3D Immersive CAPTCHA for Enhancing Security of Virtual Reality Systems PDF

Cannot Refute

[53] A captcha design based on visual reasoning PDF

Cannot Refute

[54] Image CAPTCHA: based on human understanding of real world distances PDF

Cannot Refute

[55] Attacks and design of image recognition CAPTCHAs PDF

Cannot Refute

Contribution

Procedural generation pipeline with constraint-based difficulty control

[44] HiEI: A universal framework for generating high-quality emerging images from natural images PDF

Cannot Refute

[45] Aura-CAPTCHA: A Reinforcement Learning and GAN-Enhanced Multi-Modal CAPTCHA System PDF

Cannot Refute

[46] Thesis Supervisor: Takeo Igarashi äºååµ å¥å¤« PDF

Cannot Refute

Contribution

Spatial-CAPTCHA-Bench benchmark dataset

[5] 11plus-bench: Demystifying multimodal llm spatial reasoning with cognitive-inspired analysis PDF

Cannot Refute

[9] Reframing spatial reasoning evaluation in language models: A real-world simulation benchmark for qualitative reasoning PDF

Cannot Refute

[22] Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA PDF

Cannot Refute

[37] MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks PDF

Cannot Refute

[38] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF

Cannot Refute

[39] Physion: Evaluating Physical Prediction from Vision in Humans and Machines PDF

Cannot Refute

[40] Raven: A dataset for relational and analogical visual reasoning PDF

Cannot Refute

[41] Spatial reasoning with vision-language models in ego-centric multi-view scenes PDF

Cannot Refute

[42] Grasp: A grid-based benchmark for evaluating commonsense spatial reasoning PDF

Cannot Refute

[43] Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models PDF

Cannot Refute

Spatial CAPTCHA: Generatively Benchmarking Spatial Reasoning for Human-Machine Differentiation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Spatial CAPTCHA framework for human-machine differentiation

[22] Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA PDF

[47] Vrc-graphnet: A graph neural network-based reasoning framework for attacking visual reasoning captchas PDF

[48] Robust CAPTCHAs towards malicious OCR PDF

[49] NGCaptcha: A CAPTCHA Bridging the Past and the Future PDF

[50] Adversarial Text-Based CAPTCHA Generation Method Utilizing Spatial Smoothing PDF

[51] MF-GGNN: Crack Visual Reasoning CAPTCHA Holistically Using a Novel Multi-Feature Fusion-Based Graph Gated Neural Network PDF

[52] Designing Cognitive 3D Immersive CAPTCHA for Enhancing Security of Virtual Reality Systems PDF

[53] A captcha design based on visual reasoning PDF

[54] Image CAPTCHA: based on human understanding of real world distances PDF

[55] Attacks and design of image recognition CAPTCHAs PDF

Procedural generation pipeline with constraint-based difficulty control

[44] HiEI: A universal framework for generating high-quality emerging images from natural images PDF

[45] Aura-CAPTCHA: A Reinforcement Learning and GAN-Enhanced Multi-Modal CAPTCHA System PDF

[46] Thesis Supervisor: Takeo Igarashi äºååµ å¥å¤« PDF

Spatial-CAPTCHA-Bench benchmark dataset

[5] 11plus-bench: Demystifying multimodal llm spatial reasoning with cognitive-inspired analysis PDF

[9] Reframing spatial reasoning evaluation in language models: A real-world simulation benchmark for qualitative reasoning PDF

[22] Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA PDF

[37] MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks PDF

[38] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF

[39] Physion: Evaluating Physical Prediction from Vision in Humans and Machines PDF

[40] Raven: A dataset for relational and analogical visual reasoning PDF

[41] Spatial reasoning with vision-language models in ego-centric multi-view scenes PDF

[42] Grasp: A grid-based benchmark for evaluating commonsense spatial reasoning PDF

[43] Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models PDF

Table of Contents

[46] Thesis Supervisor: Takeo Igarashi äºååµ å¥å¤« PDF