Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

3D Semantic SegmentationMixture of ExpertPoint Cloud Understanding

While massively both scaling data and models have become central in NLP and 2D vision, their benefits for 3D point cloud understanding remain limited. We study the initial step of 3D point cloud scaling under a realistic regime: large-scale multi-dataset joint training for 3D semantic segmentation, with no dataset labels available at inference time. Point clouds arise from a wide range of sensors (e.g., depth cameras, LiDAR) and scenes (e.g., indoor, outdoor), yielding heterogeneous scanning patterns, sampling densities, and semantic biases; naively mixing such datasets degrades standard backbones. We introduce Point-MoE, a Mixture-of-Experts design that expands capacity through sparsely activated expert MLPs and a lightweight top- $k$ router, allowing tokens to select specialized experts without requiring dataset supervision. Trained jointly on a diverse mix of indoor and outdoor datasets and evaluated on seen datasets and in zero-shot settings, Point-MoE outperforms prior methods without using dataset labels for either training or inference. This outlines a scalable path for 3D perception: letting the model discover structure in heterogeneous 3D data rather than imposing it via manual curation or dataset-specific heuristics.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Point-MoE, a Mixture-of-Experts architecture for large-scale multi-dataset 3D semantic segmentation that operates without dataset labels at inference. It resides in the Cross-Dataset Label Harmonization and Taxonomy Alignment leaf, which contains five papers addressing label-space conflicts across heterogeneous datasets. This leaf sits within the broader Multi-Dataset Integration and Domain Adaptation branch, indicating a moderately populated research direction focused on reconciling inconsistent taxonomies and sensor modalities. The taxonomy shows this is an active but not overcrowded area, with sibling papers exploring unified taxonomies and hierarchical mappings.

The taxonomy reveals neighboring research directions that contextualize Point-MoE's positioning. The Unsupervised and Semi-Supervised Domain Adaptation leaf (six papers) addresses similar heterogeneity challenges through unlabeled data, while Multi-Task and Multi-Domain Unified Architectures (three papers) explore shared-parameter models across tasks. The Vision-Language and Open-Vocabulary Segmentation leaf (four papers) offers an alternative approach to label alignment via textual semantics. Point-MoE diverges from these by using sparsely activated expert routing rather than explicit taxonomy engineering or language grounding, suggesting a distinct methodological stance within the broader multi-dataset integration landscape.

Among thirty candidates examined, none clearly refute the three core contributions. The Point-MoE architecture contribution examined ten candidates with zero refutable matches, as did the multi-dataset training protocol and MoE design space exploration. This limited search scope suggests that within the top-thirty semantic matches, no prior work combines mixture-of-experts routing with dataset-agnostic multi-dataset 3D segmentation in the same manner. However, the analysis does not cover the full literature: sibling papers like MSeg3D and Label Name Mantra address overlapping problems through different mechanisms, and the search may not capture all relevant MoE or multi-dataset work.

Based on the limited search of thirty candidates, Point-MoE appears to occupy a relatively novel position by applying sparse expert routing to multi-dataset 3D segmentation without dataset supervision. The taxonomy context shows this sits in a moderately active research area with established sibling approaches, but the specific combination of MoE architecture and dataset-agnostic inference has not been clearly anticipated in the examined literature. A more exhaustive search beyond top-thirty semantic matches would be needed to assess whether similar MoE-based multi-dataset strategies exist in adjacent communities or application domains.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: large-scale multi-dataset joint training for 3D semantic segmentation. The field has evolved to address the challenge of leveraging diverse 3D datasets—each with distinct sensor modalities, annotation styles, and label taxonomies—to build more robust and generalizable segmentation models. The taxonomy reveals several complementary research directions: Multi-Dataset Integration and Domain Adaptation focuses on harmonizing heterogeneous data sources and bridging domain gaps, often through cross-modal learning (e.g., xmuda[4], Cross-Modal Contrastive Domain[5]) or label alignment strategies (MSeg[12], MSeg3D[16]). Multi-Modal Fusion explores how to combine 2D imagery with 3D point clouds (Joint 2D-3D Weakly Supervised[1], Multi-View Aggregation Wild[2]), while Large-Scale Representation Learning emphasizes pre-training on massive corpora to capture transferable features (Point Transformer V3[3]). Joint Instance and Semantic Segmentation tackles the interplay between object-level and point-level predictions (JSIS3D[22]), and Specialized Application Domains target medical imaging, autonomous driving, and other verticals with domain-specific constraints. A central tension across these branches is how to reconcile inconsistent label spaces without expensive re-annotation: some works propose unified taxonomies or hierarchical mappings (Cross-Dataset Collaborative Learning[8], Heterogeneous Datasets Training[32]), while others exploit weak supervision or language-driven alignment (Label Name Mantra[45]). Point-MoE[0] sits within the Cross-Dataset Label Harmonization and Taxonomy Alignment cluster, emphasizing efficient mixture-of-experts architectures to handle label heterogeneity at scale. Compared to MSeg[12], which unifies 2D image datasets via a common taxonomy, Point-MoE[0] extends this philosophy to 3D point clouds with a focus on computational efficiency. Meanwhile, Label Name Mantra[45] leverages textual semantics for zero-shot transfer, offering a complementary angle on the same alignment problem. These contrasting strategies highlight an open question: whether explicit taxonomy engineering, learned routing mechanisms, or language grounding will prove most effective for truly large-scale multi-dataset 3D segmentation.

Claimed Contributions

Point-MoE architecture for multi-dataset 3D semantic segmentation

10 retrieved papers

The authors propose Point-MoE, a sparse mixture-of-experts architecture built on Point Transformer V3 that replaces attention projection layers with expert MLPs and a router. This design enables dynamic expert specialization across heterogeneous 3D datasets without using dataset labels during training or inference.

10 retrieved papers

Multi-dataset training protocol without dataset labels at inference

10 retrieved papers

The authors establish a realistic training and evaluation regime for large-scale multi-dataset joint training in 3D semantic segmentation where no dataset labels are available at inference time. This protocol enables fair comparison across diverse indoor and outdoor datasets in both seen and zero-shot settings.

10 retrieved papers

Systematic exploration of MoE design space for 3D point clouds

10 retrieved papers

The authors conduct comprehensive ablation studies examining key MoE design choices including number of experts, sparsity level, placement within the architecture, normalization strategies, and training configurations. These experiments reveal effective configurations and trade-offs specific to 3D point cloud understanding.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Cross-dataset collaborative learning for semantic segmentation in autonomous driving PDF

Li Dong, Liu Han, Peng Jinzhang, Shan Yi, Tian LÃ¼, Wang Li (2022)

[12] MSeg: A Composite Dataset for Multi-Domain Semantic Segmentation PDF

Lambert, John, John Lambert, Liu, Zhuang, Zhuang Liu, Sener, Ozan, Ozan Åener, Hays, James, James Hays, Koltun, Vladlen, Vladlen Koltun (2022)

[32] Training Semantic Segmentation on Heterogeneous Datasets PDF

Meletis, Panagiotis, Panagiotis Meletis, Dubbelman Gijs, Gijs Dubbelman (2023)

[45] Label name is mantra: Unifying point cloud segmentation across heterogeneous datasets PDF

Liang, Yixun, He Hao, Xiao, Shishi, Lu Hao, Chen Yingcong (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Point-MoE architecture for multi-dataset 3D semantic segmentation

[61] ME-ODAL: Mixture-of-Experts Ensemble of CNN Models for 3D Object Detection from Automotive LiDAR Point Clouds PDF

Cannot Refute

[66] FMP-Net: Fractal Multi-Gate Mixture-of-Experts Panoramic Segmentation for Point Cloud PDF

Cannot Refute

[67] LLM in the Loop: A Framework for Contextualizing Counterfactual Segment Perturbations in Point Clouds PDF

Cannot Refute

[68] Convoluted mixture of deep experts for robust semantic segmentation PDF

Cannot Refute

[69] LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes PDF

Cannot Refute

[70] Graphfit: Learning multi-scale graph-convolutional representation for point cloud normal estimation PDF

Cannot Refute

[71] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences PDF

Cannot Refute

[72] Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts PDF

Cannot Refute

[73] Fast training of diffusion transformer with extreme masking for 3d point clouds generation PDF

Cannot Refute

[74] Experts weights averaging: A new general training scheme for vision transformers PDF

Cannot Refute

Contribution

Multi-dataset training protocol without dataset labels at inference

[21] COLA: COarse LAbel pre-training for 3D semantic segmentation of sparse LiDAR datasets PDF

Cannot Refute

[31] CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios PDF

Cannot Refute

[32] Training Semantic Segmentation on Heterogeneous Datasets PDF

Cannot Refute

[37] Real-time joint semantic segmentation and depth estimation using asymmetric annotations PDF

Cannot Refute

[51] 3DLabelProp: Geometric-Driven Domain Generalization for LiDAR Semantic Segmentation in Autonomous Driving PDF

Cannot Refute

[52] WildScenes: A benchmark for 2D and 3D semantic segmentation in large-scale natural environments PDF

Cannot Refute

[53] Unidseg: Unified cross-domain 3d semantic segmentation via visual foundation models prior PDF

Cannot Refute

[54] Learning to adapt sam for segmenting cross-domain point clouds PDF

Cannot Refute

[55] Technical Report for ICRA 2025 GOOSE 3D Semantic Segmentation Challenge: Adaptive Point Cloud Understanding for Heterogeneous Robotic Systems PDF

Cannot Refute

[56] EgoLifter: Open-world 3D Segmentation for Egocentric Perception PDF

Cannot Refute

Contribution

Systematic exploration of MoE design space for 3D point clouds

[57] TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking PDF

Cannot Refute

[58] End-to-End Autonomous Guidance Method Integrated with Mixture-of-Experts for Intelligent Vehicles PDF

Cannot Refute

[59] Robust point cloud normal estimation via multi-level critical point aggregation PDF

Cannot Refute

[60] Lexicon3d: Probing visual foundation models for complex 3d scene understanding PDF

Cannot Refute

[61] ME-ODAL: Mixture-of-Experts Ensemble of CNN Models for 3D Object Detection from Automotive LiDAR Point Clouds PDF

Cannot Refute

[62] Nesti-net: Normal estimation for unstructured 3d point clouds using convolutional neural networks PDF

Cannot Refute

[63] LiteDenseMoE: an explainable lightweight densely connected mixture-of-experts network for aerial scene recognition in low contrast remote sensing images PDF

Cannot Refute

[64] Context-Aware LiDAR Point Cloud Segmentation with Expertised Decision Strategy PDF

Cannot Refute

[65] Feature-Generation-Replay Continual Learning Combined with Mixture-of-Experts for Data-Driven Autonomous Guidance PDF

Cannot Refute

[66] FMP-Net: Fractal Multi-Gate Mixture-of-Experts Panoramic Segmentation for Point Cloud PDF

Cannot Refute

Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Cross-dataset collaborative learning for semantic segmentation in autonomous driving PDF

[12] MSeg: A Composite Dataset for Multi-Domain Semantic Segmentation PDF

[32] Training Semantic Segmentation on Heterogeneous Datasets PDF

[45] Label name is mantra: Unifying point cloud segmentation across heterogeneous datasets PDF

Contribution Analysis

Point-MoE architecture for multi-dataset 3D semantic segmentation

[61] ME-ODAL: Mixture-of-Experts Ensemble of CNN Models for 3D Object Detection from Automotive LiDAR Point Clouds PDF

[66] FMP-Net: Fractal Multi-Gate Mixture-of-Experts Panoramic Segmentation for Point Cloud PDF

[67] LLM in the Loop: A Framework for Contextualizing Counterfactual Segment Perturbations in Point Clouds PDF

[68] Convoluted mixture of deep experts for robust semantic segmentation PDF

[69] LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes PDF

[70] Graphfit: Learning multi-scale graph-convolutional representation for point cloud normal estimation PDF

[71] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences PDF

[72] Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts PDF

[73] Fast training of diffusion transformer with extreme masking for 3d point clouds generation PDF

[74] Experts weights averaging: A new general training scheme for vision transformers PDF

Multi-dataset training protocol without dataset labels at inference

[21] COLA: COarse LAbel pre-training for 3D semantic segmentation of sparse LiDAR datasets PDF

[31] CitySeg: A 3D Open Vocabulary Semantic Segmentation Foundation Model in City-scale Scenarios PDF

[32] Training Semantic Segmentation on Heterogeneous Datasets PDF

[37] Real-time joint semantic segmentation and depth estimation using asymmetric annotations PDF

[51] 3DLabelProp: Geometric-Driven Domain Generalization for LiDAR Semantic Segmentation in Autonomous Driving PDF

[52] WildScenes: A benchmark for 2D and 3D semantic segmentation in large-scale natural environments PDF

[53] Unidseg: Unified cross-domain 3d semantic segmentation via visual foundation models prior PDF

[54] Learning to adapt sam for segmenting cross-domain point clouds PDF

[55] Technical Report for ICRA 2025 GOOSE 3D Semantic Segmentation Challenge: Adaptive Point Cloud Understanding for Heterogeneous Robotic Systems PDF

[56] EgoLifter: Open-world 3D Segmentation for Egocentric Perception PDF

Systematic exploration of MoE design space for 3D point clouds

[57] TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking PDF

[58] End-to-End Autonomous Guidance Method Integrated with Mixture-of-Experts for Intelligent Vehicles PDF

[59] Robust point cloud normal estimation via multi-level critical point aggregation PDF

[60] Lexicon3d: Probing visual foundation models for complex 3d scene understanding PDF

[61] ME-ODAL: Mixture-of-Experts Ensemble of CNN Models for 3D Object Detection from Automotive LiDAR Point Clouds PDF

[62] Nesti-net: Normal estimation for unstructured 3d point clouds using convolutional neural networks PDF

[63] LiteDenseMoE: an explainable lightweight densely connected mixture-of-experts network for aerial scene recognition in low contrast remote sensing images PDF

[64] Context-Aware LiDAR Point Cloud Segmentation with Expertised Decision Strategy PDF

[65] Feature-Generation-Replay Continual Learning Combined with Mixture-of-Experts for Data-Driven Autonomous Guidance PDF

[66] FMP-Net: Fractal Multi-Gate Mixture-of-Experts Panoramic Segmentation for Point Cloud PDF

Table of Contents