L4Dog: Towards BEV Perception for Quadruped Robots in Complex Urban Scenes

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

quadruped robot perceptiondataset and benchmark

Embodied intelligence in quadruped robots faces significant challenges in complex urban environments due to the limitations of traditional perception systems and the lack of comprehensive datasets for exteroceptive 3D perception. To address this, we introduce L4Dog, the first large-scale exteroceptive 3D perception dataset tailored for quadruped robots in open urban scenarios. L4Dog provides high-quality 360-degree surround-view sensor data and manual annotations, covering diverse urban scenes such as traffic-light intersections, open roads, subway station, etc. By formulating perception tasks as bird’s-eye-view (BEV) space perception problems, we establish a multi-benchmark framework for BEV detection, tracking, trajectory prediction, and 3D traversable space occupancy estimation. The OmniBEV4D baseline method is proposed to unify multi-task perception (detection, tracking, prediction, and occupancy prediction) through shared temporal BEV features, enabling efficient and robust processing of dynamic urban environments. This work bridges the gap between current research and real-world deployment needs, offering a foundational resource for advancing autonomous navigation and decision-making in complex urban settings. The dataset will be made publicly available upon acceptance of this work.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces L4Dog, a large-scale dataset for quadruped robot perception in urban environments, alongside OmniBEV4D, a unified multi-task framework for detection, tracking, prediction, and occupancy estimation. According to the taxonomy, this work resides in the 'Temporal BEV Feature-Based Multi-Task Learning' leaf under 'Multi-Task BEV Perception Systems'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this specific combination of quadruped-centric BEV perception with temporal multi-task learning represents a relatively sparse research direction within the examined literature.

The broader taxonomy reveals three main branches: Multi-Task BEV Perception Systems, Sensor Fusion for Dynamic Scene Understanding, and Quadruped Robot Design and Capabilities. The paper's position in the first branch places it adjacent to sensor fusion work (e.g., Camera-LiDAR Fusion with BEV Representations) and hardware-focused studies on disaster response quadrupeds. While the taxonomy includes only three papers total across these branches, the structure indicates that BEV-based multi-task learning for legged robots sits at the intersection of perception algorithms and platform-specific constraints, diverging from purely algorithmic or purely hardware-oriented research.

Among the three contributions analyzed, the L4Dog dataset and multi-benchmark framework each examined two candidates with zero refutations, suggesting limited direct prior work in quadruped-specific urban BEV datasets. The OmniBEV4D framework examined ten candidates and found one refutable match, indicating some overlap with existing multi-task BEV methods. The literature search scope covered fourteen candidates total, yielding one refutable pair overall. This limited search scale means the analysis captures top semantic matches but does not exhaustively survey all BEV perception or quadruped navigation literature, leaving open the possibility of additional relevant work beyond the examined set.

Given the sparse taxonomy structure and the modest search scope, the work appears to occupy a niche intersection of quadruped robotics and temporal BEV perception. The dataset contribution shows minimal overlap among examined candidates, while the algorithmic framework has at least one prior method addressing similar multi-task objectives. The analysis reflects top-thirty semantic matches and does not claim comprehensive coverage of all related domains, such as wheeled-robot BEV systems or non-temporal multi-task architectures.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: BEV perception for quadruped robots in complex urban environments. The field structure reflects three main branches that together address the challenge of enabling legged robots to navigate and understand dynamic urban settings. Multi-Task BEV Perception Systems focus on unified frameworks that handle multiple perception objectives—such as object detection, segmentation, and motion forecasting—within a bird's-eye-view representation, often leveraging temporal features to capture scene dynamics over time. Sensor Fusion for Dynamic Scene Understanding emphasizes integrating data from heterogeneous sensors (cameras, LiDAR, IMUs) to build robust spatial models that can handle occlusions, lighting variations, and moving agents. Quadruped Robot Design and Capabilities examines the hardware and locomotion strategies that enable these platforms to operate in unstructured terrain, including considerations of payload, stability, and real-time computational constraints. Together, these branches form a cohesive pipeline from raw sensor inputs through perception algorithms to actionable navigation decisions. Within Multi-Task BEV Perception Systems, a particularly active line of work explores temporal BEV feature aggregation to improve prediction accuracy and consistency across frames, balancing computational efficiency with the need for real-time inference on resource-limited mobile platforms. L4Dog[0] sits squarely in this temporal multi-task cluster, emphasizing how sequential BEV features can be fused to support simultaneous detection and tracking in crowded urban scenes. This contrasts with approaches that prioritize static snapshot perception or rely heavily on offline post-processing. Meanwhile, the Quadruped Robot Design branch highlights practical deployment challenges—such as those discussed in Robots to the Rescue[1]—where mechanical agility and sensor placement directly influence what perception tasks are feasible. The interplay between algorithmic sophistication and hardware constraints remains an open question, as does the trade-off between model complexity and the latency budgets imposed by dynamic environments.

Claimed Contributions

L4Dog dataset for quadruped BEV perception in urban scenes

2 retrieved papers

The authors present L4Dog, a large-scale dataset featuring 360-degree surround-view sensor data and manual annotations for quadruped robots operating in complex urban environments such as traffic intersections and subway stations. This is the first dataset to formulate quadruped exteroceptive perception as BEV-space perception tasks.

2 retrieved papers

Multi-benchmark framework for BEV perception tasks

2 retrieved papers

The authors establish comprehensive benchmarks for quadruped robots including BEV object detection, BEV tracking, and trajectory prediction tasks. These benchmarks are designed to evaluate perception capabilities in complex urban scenarios with dense traffic and pedestrians.

2 retrieved papers

OmniBEV4D multi-task perception framework

Can Refute

10 retrieved papers

The authors propose OmniBEV4D, a unified neural network framework that performs multiple perception tasks (detection, tracking, trajectory prediction, and occupancy estimation) simultaneously by sharing spatiotemporal feature computation. This serves as the baseline method for L4Dog benchmark tasks.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

L4Dog dataset for quadruped BEV perception in urban scenes

[15] Autonomous social distancing in urban environments using a quadruped robot PDF

Cannot Refute

[16] Bridging Perception and Control for Legged Locomotion and Navigation in the Wild PDF

Cannot Refute

Contribution

Multi-benchmark framework for BEV perception tasks

[13] Real-time 3D Perception System of Construction Equipment for Autonomous Mobile Robots PDF

Cannot Refute

[14] Perspective-Invariant 3D Object DetectionâSupplementary Materialâ PDF

Cannot Refute

Contribution

OmniBEV4D multi-task perception framework

[4] Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving PDF

Can Refute

[3] Bevformer: learning bird's-eye-view representation from lidar-camera via spatiotemporal transformers PDF

Cannot Refute

[5] Exploring recurrent long-term temporal fusion for multi-view 3d perception PDF

Cannot Refute

[6] OnlineBEV: Recurrent Temporal Fusion in Bird's Eye View Representations for Multi-Camera 3D Perception PDF

Cannot Refute

[7] Bev-tp: end-to-end visual perception and trajectory prediction for autonomous driving PDF

Cannot Refute

[8] STâSIGMA: Spatioâtemporal semantics and interaction graph aggregation for multiâagent perception and trajectory forecasting PDF

Cannot Refute

[9] TS-BEV: BEV object detection algorithm based on temporal-spatial feature fusion PDF

Cannot Refute

[10] Motionnet: Joint perception and motion prediction for autonomous driving based on bird's eye view maps PDF

Cannot Refute

[11] AttentiveGRU: Recurrent Spatio-Temporal Modeling for Advanced Radar-Based BEV Object Detection PDF

Cannot Refute

[12] Recurrentbev: A long-term temporal fusion framework for multi-view 3d detection PDF

Cannot Refute

L4Dog: Towards BEV Perception for Quadruped Robots in Complex Urban Scenes

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

L4Dog dataset for quadruped BEV perception in urban scenes

[15] Autonomous social distancing in urban environments using a quadruped robot PDF

[16] Bridging Perception and Control for Legged Locomotion and Navigation in the Wild PDF

Multi-benchmark framework for BEV perception tasks

[13] Real-time 3D Perception System of Construction Equipment for Autonomous Mobile Robots PDF

[14] Perspective-Invariant 3D Object DetectionâSupplementary Materialâ PDF

OmniBEV4D multi-task perception framework

[4] Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving PDF

[3] Bevformer: learning bird's-eye-view representation from lidar-camera via spatiotemporal transformers PDF

[5] Exploring recurrent long-term temporal fusion for multi-view 3d perception PDF

[6] OnlineBEV: Recurrent Temporal Fusion in Bird's Eye View Representations for Multi-Camera 3D Perception PDF

[7] Bev-tp: end-to-end visual perception and trajectory prediction for autonomous driving PDF

[8] STâSIGMA: Spatioâtemporal semantics and interaction graph aggregation for multiâagent perception and trajectory forecasting PDF

[9] TS-BEV: BEV object detection algorithm based on temporal-spatial feature fusion PDF

[10] Motionnet: Joint perception and motion prediction for autonomous driving based on bird's eye view maps PDF

[11] AttentiveGRU: Recurrent Spatio-Temporal Modeling for Advanced Radar-Based BEV Object Detection PDF

[12] Recurrentbev: A long-term temporal fusion framework for multi-view 3d detection PDF

Table of Contents

[14] Perspective-Invariant 3D Object DetectionâSupplementary Materialâ PDF

[8] STâSIGMA: Spatioâtemporal semantics and interaction graph aggregation for multiâagent perception and trajectory forecasting PDF