Abstract:

Recent advances in multi-modal large reasoning models (MLRMs) have shown significant ability to interpret complex visual content. While these models possess impressive reasoning capabilities, they also introduce novel and underexplored privacy risks. In this paper, we identify a novel category of privacy leakage in MLRMs: Adversaries can infer sensitive geolocation information, such as users' home addresses or neighborhoods, from user-generated images, including selfies captured in private settings. To formalize and evaluate these risks, we propose a three-level privacy risk framework that categorizes image based on contextual sensitivity and potential for geolocation inference. We further introduce DoxBench, a curated dataset of 500 real-world images reflecting diverse privacy scenarios divided into 6 categories. Our evaluation across 13 advanced MLRMs and MLLMs demonstrates that most of these models outperform non-expert humans in geolocation inference and can effectively leak location-related private information. This significantly lowers the barrier for adversaries to obtain users' sensitive geolocation information. We further analyze and identify two primary factors contributing to this vulnerability: (1) MLRMs exhibit strong geolocation reasoning capabilities by leveraging visual clues in combination with their internal world knowledge; and (2) MLRMs frequently rely on privacy-related visual clues for inference without any built-in mechanisms to suppress or avoid such usage. To better understand and demonstrate real-world attack feasibility, we propose GeoMiner, a collaborative attack framework that decomposes the prediction process into two stages consisting of clue extraction and reasoning to improve geolocation performance. Our findings highlight the urgent need to reassess inference-time privacy risks in MLRMs to better protect users' sensitive information.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DoxBench, a 500-image dataset with a three-level privacy risk framework, alongside ClueMiner and GeoMiner tools for analyzing geolocation inference attacks by multi-modal models. It resides in the Privacy Risk Assessment and Mitigation leaf, which contains five papers total—a moderately populated niche within the broader 50-paper taxonomy. This leaf focuses specifically on identifying and quantifying privacy threats from geolocation systems, distinguishing it from performance-oriented benchmarking or model development branches. The work addresses adversarial doxing scenarios, a narrower framing than general geo-privacy policy or mitigation strategies explored by sibling papers.

The taxonomy reveals that Privacy Risk Assessment sits under Evaluation, Benchmarking, and Privacy Analysis, adjacent to Benchmark Datasets and Comparative Evaluation and Geospatial AI and Trajectory Prediction. Neighboring branches like Multi-Modal Foundation Model Architectures focus on advancing model capabilities (e.g., Gaea, LLMGeo), while Specialized Geolocation Contexts address domain-specific challenges such as disaster response or indoor localization. The paper's emphasis on adversarial exploitation of existing models contrasts with these capability-building efforts, positioning it as a critical counterpoint that examines societal risks rather than technical performance gains. Its scope excludes mitigation mechanisms beyond analysis, per the leaf's exclude_note.

Among 28 candidates examined, none clearly refute the three core contributions. The DoxBench dataset and privacy framework (8 candidates, 0 refutable) appear novel in their focus on real-world doxing scenarios with structured risk categorization. ClueMiner (10 candidates, 0 refutable) and GeoMiner (10 candidates, 0 refutable) show no direct prior work within the limited search scope. Sibling papers like Geolocation Privacy Risks and Granular Privacy Control address related privacy concerns but do not present equivalent datasets or collaborative attack frameworks. The absence of refutable candidates suggests these specific artifacts are new, though the search scale limits certainty about exhaustive prior work.

Based on top-28 semantic matches, the work introduces concrete evaluation artifacts—dataset, risk taxonomy, and attack tools—that fill a gap in adversarial privacy analysis for geolocation models. The limited search scope means undiscovered prior work may exist, particularly in adjacent security or privacy communities outside the core geolocation literature. The contributions appear incremental in concept (privacy risks are known) but novel in execution (structured benchmarks for doxing attacks). Further investigation of broader security venues would strengthen confidence in this assessment.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: geolocation inference from user-generated images using multi-modal models. The field has evolved around several complementary branches. Multi-Modal Foundation Model Architectures and Training focuses on developing and fine-tuning large-scale vision-language models that can reason about geographic cues, as seen in works like Gaea[2] and GeoLocSFT[3]. Multi-Modal Data Fusion and Representation Learning explores how to combine visual features with textual metadata, temporal signals, or other modalities to improve localization accuracy. Specialized Geolocation Contexts and Applications addresses domain-specific challenges such as disaster response (Disaster Geolocalization[5]), news verification (News Photo Geolocation[16]), or street-level positioning (Street-Level Geolocalization[10]). Semantic Geolocation and Address Prediction targets fine-grained outputs like postal addresses or natural-language place descriptions, while Evaluation, Benchmarking, and Privacy Analysis examines both performance metrics and the societal risks of increasingly powerful geolocation systems. Recent work highlights a tension between advancing model capabilities and mitigating privacy harms. On one hand, studies like Omnigeo[17] and LLMGeo[31] push the frontier of what multi-modal models can infer from minimal visual clues. On the other hand, privacy-focused research investigates how easily such models can be exploited for malicious purposes. Doxing via Lens[0] sits squarely within the Privacy Risk Assessment and Mitigation cluster, examining adversarial scenarios where geolocation inference threatens individual anonymity. It shares thematic concerns with Geolocation Privacy Risks[8] and Granular Privacy Control[14], which also explore safeguards and threat models, yet differs in its emphasis on real-world doxing attacks rather than broader policy frameworks. This line of inquiry underscores an urgent open question: how to balance the utility of geolocation models in legitimate applications against their potential for abuse.

Claimed Contributions

DOXBENCH dataset and three-level privacy risk framework

The authors introduce a novel three-level framework (individual risk, household risk, and both) grounded in GDPR and CCPA regulations to categorize privacy risks in images. They also construct DOXBENCH, a benchmark dataset of 500 high-resolution images from California representing diverse privacy scenarios across six categories to evaluate location-related privacy leakage.

8 retrieved papers
CLUEMINER analysis tool

The authors develop CLUEMINER, a test-time adaptation algorithm that iteratively derives unified semantic categories of visual clues from unstructured model reasoning outputs. This tool reveals that MLRMs frequently rely on privacy-sensitive visual clues without built-in mechanisms to suppress such usage.

10 retrieved papers
GEOMINER collaborative attack framework

The authors propose GEOMINER, a two-stage framework simulating realistic adversarial scenarios where a Detector MLLM extracts visual clues and an Analyzer MLLM uses them for geolocation inference. This framework demonstrates how attackers can amplify location-related privacy leakage by providing contextual hints to MLLMs.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DOXBENCH dataset and three-level privacy risk framework

The authors introduce a novel three-level framework (individual risk, household risk, and both) grounded in GDPR and CCPA regulations to categorize privacy risks in images. They also construct DOXBENCH, a benchmark dataset of 500 high-resolution images from California representing diverse privacy scenarios across six categories to evaluate location-related privacy leakage.

Contribution

CLUEMINER analysis tool

The authors develop CLUEMINER, a test-time adaptation algorithm that iteratively derives unified semantic categories of visual clues from unstructured model reasoning outputs. This tool reveals that MLRMs frequently rely on privacy-sensitive visual clues without built-in mechanisms to suppress such usage.

Contribution

GEOMINER collaborative attack framework

The authors propose GEOMINER, a two-stage framework simulating realistic adversarial scenarios where a Detector MLLM extracts visual clues and an Analyzer MLLM uses them for geolocation inference. This framework demonstrates how attackers can amplify location-related privacy leakage by providing contextual hints to MLLMs.