Point-MoE: Large-Scale Multi-Dataset Training with Mixture-of-Experts for 3D Semantic Segmentation

ICLR 2026 Conference SubmissionAnonymous Authors
3D Semantic SegmentationMixture of ExpertPoint Cloud Understanding
Abstract:

While massively both scaling data and models have become central in NLP and 2D vision, their benefits for 3D point cloud understanding remain limited. We study the initial step of 3D point cloud scaling under a realistic regime: large-scale multi-dataset joint training for 3D semantic segmentation, with no dataset labels available at inference time. Point clouds arise from a wide range of sensors (e.g., depth cameras, LiDAR) and scenes (e.g., indoor, outdoor), yielding heterogeneous scanning patterns, sampling densities, and semantic biases; naively mixing such datasets degrades standard backbones. We introduce Point-MoE, a Mixture-of-Experts design that expands capacity through sparsely activated expert MLPs and a lightweight top-kk router, allowing tokens to select specialized experts without requiring dataset supervision. Trained jointly on a diverse mix of indoor and outdoor datasets and evaluated on seen datasets and in zero-shot settings, Point-MoE outperforms prior methods without using dataset labels for either training or inference. This outlines a scalable path for 3D perception: letting the model discover structure in heterogeneous 3D data rather than imposing it via manual curation or dataset-specific heuristics.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Point-MoE, a Mixture-of-Experts architecture for large-scale multi-dataset 3D semantic segmentation that operates without dataset labels at inference. It resides in the Cross-Dataset Label Harmonization and Taxonomy Alignment leaf, which contains five papers addressing label-space conflicts across heterogeneous datasets. This leaf sits within the broader Multi-Dataset Integration and Domain Adaptation branch, indicating a moderately populated research direction focused on reconciling inconsistent taxonomies and sensor modalities. The taxonomy shows this is an active but not overcrowded area, with sibling papers exploring unified taxonomies and hierarchical mappings.

The taxonomy reveals neighboring research directions that contextualize Point-MoE's positioning. The Unsupervised and Semi-Supervised Domain Adaptation leaf (six papers) addresses similar heterogeneity challenges through unlabeled data, while Multi-Task and Multi-Domain Unified Architectures (three papers) explore shared-parameter models across tasks. The Vision-Language and Open-Vocabulary Segmentation leaf (four papers) offers an alternative approach to label alignment via textual semantics. Point-MoE diverges from these by using sparsely activated expert routing rather than explicit taxonomy engineering or language grounding, suggesting a distinct methodological stance within the broader multi-dataset integration landscape.

Among thirty candidates examined, none clearly refute the three core contributions. The Point-MoE architecture contribution examined ten candidates with zero refutable matches, as did the multi-dataset training protocol and MoE design space exploration. This limited search scope suggests that within the top-thirty semantic matches, no prior work combines mixture-of-experts routing with dataset-agnostic multi-dataset 3D segmentation in the same manner. However, the analysis does not cover the full literature: sibling papers like MSeg3D and Label Name Mantra address overlapping problems through different mechanisms, and the search may not capture all relevant MoE or multi-dataset work.

Based on the limited search of thirty candidates, Point-MoE appears to occupy a relatively novel position by applying sparse expert routing to multi-dataset 3D segmentation without dataset supervision. The taxonomy context shows this sits in a moderately active research area with established sibling approaches, but the specific combination of MoE architecture and dataset-agnostic inference has not been clearly anticipated in the examined literature. A more exhaustive search beyond top-thirty semantic matches would be needed to assess whether similar MoE-based multi-dataset strategies exist in adjacent communities or application domains.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: large-scale multi-dataset joint training for 3D semantic segmentation. The field has evolved to address the challenge of leveraging diverse 3D datasets—each with distinct sensor modalities, annotation styles, and label taxonomies—to build more robust and generalizable segmentation models. The taxonomy reveals several complementary research directions: Multi-Dataset Integration and Domain Adaptation focuses on harmonizing heterogeneous data sources and bridging domain gaps, often through cross-modal learning (e.g., xmuda[4], Cross-Modal Contrastive Domain[5]) or label alignment strategies (MSeg[12], MSeg3D[16]). Multi-Modal Fusion explores how to combine 2D imagery with 3D point clouds (Joint 2D-3D Weakly Supervised[1], Multi-View Aggregation Wild[2]), while Large-Scale Representation Learning emphasizes pre-training on massive corpora to capture transferable features (Point Transformer V3[3]). Joint Instance and Semantic Segmentation tackles the interplay between object-level and point-level predictions (JSIS3D[22]), and Specialized Application Domains target medical imaging, autonomous driving, and other verticals with domain-specific constraints. A central tension across these branches is how to reconcile inconsistent label spaces without expensive re-annotation: some works propose unified taxonomies or hierarchical mappings (Cross-Dataset Collaborative Learning[8], Heterogeneous Datasets Training[32]), while others exploit weak supervision or language-driven alignment (Label Name Mantra[45]). Point-MoE[0] sits within the Cross-Dataset Label Harmonization and Taxonomy Alignment cluster, emphasizing efficient mixture-of-experts architectures to handle label heterogeneity at scale. Compared to MSeg[12], which unifies 2D image datasets via a common taxonomy, Point-MoE[0] extends this philosophy to 3D point clouds with a focus on computational efficiency. Meanwhile, Label Name Mantra[45] leverages textual semantics for zero-shot transfer, offering a complementary angle on the same alignment problem. These contrasting strategies highlight an open question: whether explicit taxonomy engineering, learned routing mechanisms, or language grounding will prove most effective for truly large-scale multi-dataset 3D segmentation.

Claimed Contributions

Point-MoE architecture for multi-dataset 3D semantic segmentation

The authors propose Point-MoE, a sparse mixture-of-experts architecture built on Point Transformer V3 that replaces attention projection layers with expert MLPs and a router. This design enables dynamic expert specialization across heterogeneous 3D datasets without using dataset labels during training or inference.

10 retrieved papers
Multi-dataset training protocol without dataset labels at inference

The authors establish a realistic training and evaluation regime for large-scale multi-dataset joint training in 3D semantic segmentation where no dataset labels are available at inference time. This protocol enables fair comparison across diverse indoor and outdoor datasets in both seen and zero-shot settings.

10 retrieved papers
Systematic exploration of MoE design space for 3D point clouds

The authors conduct comprehensive ablation studies examining key MoE design choices including number of experts, sparsity level, placement within the architecture, normalization strategies, and training configurations. These experiments reveal effective configurations and trade-offs specific to 3D point cloud understanding.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Point-MoE architecture for multi-dataset 3D semantic segmentation

The authors propose Point-MoE, a sparse mixture-of-experts architecture built on Point Transformer V3 that replaces attention projection layers with expert MLPs and a router. This design enables dynamic expert specialization across heterogeneous 3D datasets without using dataset labels during training or inference.

Contribution

Multi-dataset training protocol without dataset labels at inference

The authors establish a realistic training and evaluation regime for large-scale multi-dataset joint training in 3D semantic segmentation where no dataset labels are available at inference time. This protocol enables fair comparison across diverse indoor and outdoor datasets in both seen and zero-shot settings.

Contribution

Systematic exploration of MoE design space for 3D point clouds

The authors conduct comprehensive ablation studies examining key MoE design choices including number of experts, sparsity level, placement within the architecture, normalization strategies, and training configurations. These experiments reveal effective configurations and trade-offs specific to 3D point cloud understanding.