Image Quality Assessment for Embodied AI

ICLR 2026 Conference SubmissionAnonymous Authors
Image Quality Assessment; Image Processing; Perceptual Quality; Embodied AI;
Abstract:

Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 30k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a perception-cognition-decision-execution pipeline for assessing image quality in embodied AI contexts, establishes the Embodied-IQA database with over 30,000 image pairs and 5 million annotations from vision-language models and real robots, and benchmarks mainstream IQA methods on this data. Within the taxonomy, it resides in the 'Embodied-Specific Quality Assessment' leaf under 'Quality Assessment Frameworks and Benchmarks', alongside three sibling papers. This leaf represents a relatively sparse research direction within a 50-paper taxonomy spanning 23 leaf nodes, suggesting the work addresses an emerging rather than saturated area.

The taxonomy reveals that quality assessment for embodied AI sits at the intersection of multiple research streams. Neighboring leaves include 'World Model and Generative Content Evaluation' (assessing scene quality and physical plausibility in generative systems) and 'General Visual Quality Assessment' (broader multimedia quality metrics). The paper's focus on robot-centric usability distinguishes it from general visual quality work, while its emphasis on task-driven metrics connects to navigation and manipulation branches. The taxonomy's scope notes clarify that embodied-specific quality assessment excludes general multimedia metrics, positioning this work as bridging perceptual quality and downstream task performance.

Among 24 candidates examined across three contributions, the analysis found limited prior work overlap. The perception-cognition-decision-execution pipeline examined 10 candidates with 1 potential refutation; the database construction examined 4 candidates with 1 refutation; and the benchmark evaluation examined 10 candidates with 2 refutations. These statistics suggest that within the top-24 semantic matches, most contributions appear relatively novel, though the search scope is modest. The pipeline and database contributions show particularly sparse prior work, while the benchmarking component encounters slightly more existing evaluation efforts.

Based on this limited search of 24 candidates, the work appears to occupy a relatively underexplored niche at the intersection of image quality assessment and embodied task performance. The sparse sibling count and low refutation rates suggest novelty, though the analysis does not cover exhaustive literature review or domain-specific venues. The taxonomy structure indicates this is an emerging research direction rather than a mature subfield, consistent with the observed scarcity of directly comparable prior work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Image quality assessment for embodied artificial intelligence tasks. The field spans a diverse set of challenges, from developing quality metrics and benchmarks tailored to embodied settings, to building robust visual perception systems, navigation and manipulation capabilities, scene reconstruction methods, generative world models, and comprehensive simulation platforms. At the top level, the taxonomy organizes work into eight major branches: Quality Assessment Frameworks and Benchmarks focuses on metrics and evaluation protocols specific to embodied contexts (e.g., Perceptual Quality Embodied[1], Embodied Image Quality[12]); Visual Perception and Representation Learning addresses how agents encode and interpret visual input (e.g., Artificial Visual Cortex[3], Visual Embedding Distillation[6]); Embodied Navigation and Spatial Reasoning explores goal-driven movement and spatial understanding (e.g., Objectnav Revisited[5], Omnidirectional Spatial Reasoning[18]); Manipulation and Interaction examines physical interaction with objects (e.g., PerTouch[39], TextToucher[25]); Scene Reconstruction and 3D Representation deals with building spatial models (e.g., Lightweight Gaussian Splatting[22], OGGSplat[46]); Generative Models and World Simulation investigates predictive and generative approaches (e.g., Generative Physical AI[17], World in World[19]); Simulation Platforms and Datasets provides testbeds and data resources (e.g., Habitat Matterport[9], Ewmbench[2]); and System Integration and Applications brings these components together in real-world deployments (e.g., Multimodal Indoor Robotics[4], Embodied AI Vehicular[15]). A particularly active line of work centers on defining and measuring perceptual quality in ways that align with embodied task performance, contrasting traditional image quality metrics with task-driven assessments. Image Quality Embodied AI[0] sits squarely within the Quality Assessment Frameworks and Benchmarks branch, specifically under Embodied-Specific Quality Assessment, where it joins efforts like Perceptual Quality Embodied[1] and RGC-VQA[49] in developing metrics that account for agent-centric visual demands. While Perceptual Quality Embodied[1] emphasizes human-aligned perceptual measures, Image Quality Embodied AI[0] appears to focus more directly on how image degradation affects downstream embodied task success, bridging quality assessment with navigation and manipulation outcomes. This contrasts with broader embodied AI surveys (e.g., Embodied AI Survey[14]) that catalog task types without deep dives into quality metrics, and with works like Embodied Image Compression[41] that optimize compression for embodied scenarios. The central tension across these branches involves balancing perceptual fidelity, computational efficiency, and task-specific relevance—questions that remain open as embodied systems scale to more complex, real-world environments.

Claimed Contributions

Perception-cognition-decision-execution pipeline for Embodied AI quality assessment

The authors develop a theoretical framework grounded in Mertonian systems and meta-cognitive theory that structures Embodied AI evaluation into four stages: perception, cognition, decision, and execution. This pipeline defines how to collect quality scores for robotic tasks.

10 retrieved papers
Can Refute
Embodied-IQA database with multi-stage annotations

The authors create a large-scale database of reference and distorted image pairs for embodied tasks, annotated by VLMs, VLAs, and real robots. This resource provides fine-grained labels across cognition, decision, and execution stages to support quality metric development.

4 retrieved papers
Can Refute
Benchmark evaluation of IQA methods for Embodied AI

The authors evaluate 15 existing IQA methods on their Embodied-IQA database, showing that current approaches are insufficient for robotic perception tasks. They also conduct real-world robot experiments to reveal connections among cognition, decision, and execution.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Perception-cognition-decision-execution pipeline for Embodied AI quality assessment

The authors develop a theoretical framework grounded in Mertonian systems and meta-cognitive theory that structures Embodied AI evaluation into four stages: perception, cognition, decision, and execution. This pipeline defines how to collect quality scores for robotic tasks.

Contribution

Embodied-IQA database with multi-stage annotations

The authors create a large-scale database of reference and distorted image pairs for embodied tasks, annotated by VLMs, VLAs, and real robots. This resource provides fine-grained labels across cognition, decision, and execution stages to support quality metric development.

Contribution

Benchmark evaluation of IQA methods for Embodied AI

The authors evaluate 15 existing IQA methods on their Embodied-IQA database, showing that current approaches are insufficient for robotic perception tasks. They also conduct real-world robot experiments to reveal connections among cognition, decision, and execution.