EquAct: An SE(3)-Equivariant Multi-Task Transformer for 3D Robotic Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
SE(3) Equivariance; Multi-task Transformer; sample efficient
Abstract:

Multi-task manipulation policy often builds on transformer's ability to jointly process language instructions and 3D observations in a shared embedding space. However, real-world tasks frequently require robots to generalize to novel 3D object poses. Policies based on shared embedding break geometric consistency and struggle in 3D generation. To address this issue, we propose EquAct, which is theoretically guaranteed to generalize to novel 3D scene transformations by leveraging SE(3) equivariance shared across both language, observations, and action. EquAct makes two key contributions: (1) an efficient SE(3)-equivariant point cloud-based U-net with spherical Fourier features for policy reasoning, and (2) SE(3)-invariant Feature-wise Linear Modulation (iFiLM) layers for language conditioning. Finally, EquAct demonstrates strong spatial generalization ability and achieves state-of-the-art across 1818 RLBench tasks with both SE(3) and SE(2) scene perturbations, different amounts of training data, and on 44 physical tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes EquAct, an SE(3)-equivariant transformer architecture for multi-task manipulation with language conditioning. It resides in the 'Transformer-Based SE(3)-Equivariant Policies' leaf, which contains only two papers including this one. This indicates a relatively sparse research direction within the broader taxonomy of 11 papers across 8 leaf nodes. The sibling paper explores open-loop variants of equivariant policies, suggesting the leaf focuses specifically on transformer-based approaches that enforce geometric consistency through equivariance rather than end-to-end learning without structural priors.

The taxonomy reveals that EquAct sits within the 'SE(3)-Equivariant Policy Architectures' branch, which also includes equivariant grasp learning and open-vocabulary manipulation methods. Neighboring branches pursue alternative philosophies: 'Vision-Language-Action Models' emphasize large-scale pre-training and multimodal fusion without explicit equivariance constraints, while 'Keypoint-Based Task Specification' uses structured symbolic representations rather than learned geometric features. The taxonomy's scope notes clarify that EquAct's transformer-based equivariant design distinguishes it from both non-equivariant VLA models and keypoint-driven reasoning approaches, positioning it at the intersection of geometric structure and language grounding.

Among 18 candidates examined, the analysis found limited prior work overlap. The first contribution (SE(3)-equivariant transformer with U-net and iFiLM) examined 4 candidates with 1 potential refutation. The second contribution (equivariant U-net with spherical Fourier features) also examined 4 candidates with 1 refutation. The third contribution (mathematical proofs of equivariance properties) examined 10 candidates with 2 refutations. These statistics suggest that while some architectural components or theoretical results may have precedents in the limited search scope, the specific combination of transformer-based equivariance with language conditioning via iFiLM layers appears less explored within the examined literature.

Based on the top-18 semantic matches examined, EquAct appears to occupy a relatively novel position combining geometric equivariance with language-conditioned multi-task learning. The sparse population of its taxonomy leaf and the limited refutations found suggest this integration is less common than either pure equivariant methods or language-driven VLA models. However, the analysis does not cover exhaustive literature search beyond these candidates, and the field's rapid evolution may mean additional relevant work exists outside this scope.

Taxonomy

Core-task Taxonomy Papers
10
3
Claimed Contributions
16
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: SE(3)-equivariant multi-task robotic manipulation with language conditioning. This field addresses how robots can learn manipulation policies that respect 3D geometric symmetries while interpreting natural language instructions across diverse tasks. The taxonomy reveals four main branches. SE(3)-Equivariant Policy Architectures focus on building neural networks that inherently respect rotational and translational symmetries in 3D space, often using specialized layers or transformer designs to ensure predictions remain consistent under coordinate frame changes. Vision-Language-Action Models for Manipulation emphasize large-scale pre-training and multimodal fusion, leveraging vision-language models to ground instructions in visual scenes and generate action sequences. Keypoint-Based Task Specification and Reasoning takes a more structured approach, representing tasks through semantic keypoints or spatial anchors that can be manipulated symbolically. Survey and Taxonomic Studies provide meta-level perspectives on these evolving methodologies, as seen in works like Diffusion Policy Survey[2]. Recent activity highlights contrasting philosophies: some lines pursue end-to-end learning with minimal inductive bias (e.g., Self Correcting VLA[8], Object Centric VLA[6]), while others inject geometric or symbolic structure to improve sample efficiency and generalization (ReKep[4], OrbitGrasp[7]). The original paper EquAct[0] sits squarely within the SE(3)-Equivariant Policy Architectures branch, specifically among transformer-based approaches. It shares close conceptual ties with EquAct OpenLoop[1], which explores open-loop variants of equivariant policies, but distinguishes itself by integrating language conditioning more tightly into the equivariant framework. Compared to broader vision-language-action models like Real Sim Real VLM[3] or CrayonRobo[5], EquAct[0] prioritizes geometric consistency over large-scale pre-training, trading dataset scale for stronger inductive biases that can accelerate learning on manipulation tasks with clear 3D structure.

Claimed Contributions

SE(3)-equivariant multi-task transformer with efficient U-net architecture and iFiLM layers

The authors propose EquAct, a novel multi-task keyframe policy that achieves continuous SE(3) equivariance through a point transformer U-net using spherical Fourier features and introduces invariant FiLM layers to condition the policy on language instructions while preserving geometric invariance.

3 retrieved papers
Novel equivariant U-net architecture with spherical Fourier maxpooling and upsampling

The paper introduces a new SE(3)-equivariant Point Transformer U-Net that incorporates novel spherical Fourier maxpooling and upsampling layers to efficiently compress and reconstruct point cloud features while maintaining equivariance, improving computational efficiency over prior equivariant architectures.

3 retrieved papers
Mathematical proofs of equivariance and invariance properties

The authors provide formal mathematical proofs demonstrating that their proposed components (spherical Fourier maxpooling, upsampling, iFiLM layers, and field networks) satisfy SE(3) equivariance with respect to observations and actions, and SE(3) invariance with respect to language instructions.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SE(3)-equivariant multi-task transformer with efficient U-net architecture and iFiLM layers

The authors propose EquAct, a novel multi-task keyframe policy that achieves continuous SE(3) equivariance through a point transformer U-net using spherical Fourier features and introduces invariant FiLM layers to condition the policy on language instructions while preserving geometric invariance.

Contribution

Novel equivariant U-net architecture with spherical Fourier maxpooling and upsampling

The paper introduces a new SE(3)-equivariant Point Transformer U-Net that incorporates novel spherical Fourier maxpooling and upsampling layers to efficiently compress and reconstruct point cloud features while maintaining equivariance, improving computational efficiency over prior equivariant architectures.

Contribution

Mathematical proofs of equivariance and invariance properties

The authors provide formal mathematical proofs demonstrating that their proposed components (spherical Fourier maxpooling, upsampling, iFiLM layers, and field networks) satisfy SE(3) equivariance with respect to observations and actions, and SE(3) invariance with respect to language instructions.