WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Text-Motion Retrieval3D Human Motion

Text-Motion Retrieval (TMR) aims to retrieve 3D motion sequences semantically relevant to text descriptions. However, matching 3D motions with text remains highly challenging, primarily due to the intricate structure of human body and its spatial-temporal dynamics. Existing approaches often overlook these complexities, relying on general encoding methods that fail to distinguish different body parts and their dynamics, limiting precise semantic alignment. To address this, we propose WaMo, a new wavelet-based multi-frequency feature extraction framework. It fully captures joint-specific and time-varying motion details across multiple resolutions on body joints, extracting discriminative motion features to achieve fine-grained alignment with texts. WaMo has three key components: (1) Trajectory Wavelet Decomposition decomposes motion signals into frequency components that preserve both local kinematic details and global motion semantics. (2) Trajectory Wavelet Reconstruction uses learnable inverse wavelet transforms to reconstruct original joint trajectories from extracted features, ensuring the preservation of essential spatial-temporal information. (3) Disordered Motion Sequence Prediction reorders shuffled motion sequences to improve learning of inherent temporal coherence, enhancing motion-text alignment. Extensive experiments demonstrate WaMo's superiority, achieving 17.0% and 18.2% improvements in $Rsum$ on HumanML3D and KIT-ML datasets, respectively, outperforming existing state-of-the-art (SOTA) methods. Code will be open-sourced upon acceptance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes WaMo, a wavelet-based multi-frequency feature extraction framework for text-motion retrieval. According to the taxonomy tree, it resides in the 'Multi-Frequency and Wavelet-Based Retrieval' leaf under 'Core Text-Motion Retrieval Methods'. Notably, this leaf contains only one paper (WaMo itself), indicating a relatively sparse research direction within the broader text-motion retrieval landscape. The taxonomy shows 50 papers across approximately 36 topics, with most retrieval work concentrated in contrastive learning, hierarchical alignment, and multi-modal frameworks rather than frequency-domain approaches.

The taxonomy reveals that neighboring leaves focus on alternative retrieval strategies: 'Contrastive Learning-Based Retrieval' (TMR, Morag), 'Hierarchical and Part-Level Semantic Alignment', and 'Multi-Instance Multi-Label Retrieval'. These approaches emphasize embedding space structure, semantic hierarchies, or atomic motion decomposition rather than frequency-domain analysis. The scope note for WaMo's leaf explicitly excludes 'single-frequency or non-wavelet feature extraction methods', positioning wavelet decomposition as a distinctive technical choice. This suggests WaMo diverges from mainstream retrieval architectures by prioritizing multi-resolution temporal analysis over purely spatial or sequential encodings.

Among three identified contributions, the literature search examined 12 candidates total. The core WaMo framework (Contribution 1) examined 10 candidates with zero refutations, suggesting limited prior work on wavelet-based retrieval architectures within the search scope. The Trajectory Wavelet Reconstruction module (Contribution 2) examined zero candidates, indicating no direct overlap found. However, the Disordered Motion Sequence Prediction component (Contribution 3) examined 2 candidates and found 1 refutable match, suggesting this temporal structure learning technique has more substantial prior work. The limited search scale (12 candidates, not hundreds) means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the limited search scope of 12 candidates, WaMo's wavelet-based approach appears relatively novel within the examined literature, particularly for the core framework and reconstruction module. The temporal prediction component shows more overlap with existing work. The sparse population of its taxonomy leaf (only WaMo itself) and the concentration of retrieval research in alternative paradigms suggest this frequency-domain direction remains underexplored, though the small search scale prevents definitive claims about field-wide novelty.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: text-motion retrieval focuses on matching natural language descriptions to corresponding human motion sequences, a fundamental capability for applications ranging from animation to robotics. The field's taxonomy reveals several major branches: Core Text-Motion Retrieval Methods develop specialized architectures for aligning text and motion embeddings, often exploring contrastive learning and multi-modal encoders such as TMR[7] and Morag[17]. Retrieval-Augmented Motion Generation leverages retrieved examples to guide synthesis, while Text-to-Motion Generation Methods like T2M-GPT[19] and ReMoDiffuse[8] emphasize creating novel motions from scratch. Motion Editing and Preference Alignment branches address refinement and user-driven customization, and Motion-Language Representation Learning investigates foundational encodings that capture semantic nuances. Related Cross-Modal Retrieval Tasks draw connections to video-text and other modalities, while Surveys, Benchmarks, and Overviews provide structured evaluations and taxonomies like Motion Generation Overview[18]. Within Core Text-Motion Retrieval Methods, a particularly active line of work explores frequency-domain and wavelet-based representations to capture motion dynamics at multiple temporal scales, contrasting with purely spatial or sequential encodings. WaMo[0] situates itself in this Multi-Frequency and Wavelet-Based Retrieval cluster, emphasizing decomposition strategies that reveal fine-grained temporal patterns often missed by standard embeddings. This approach differs from works like TMR[7], which relies on contrastive triplet losses in a unified latent space, and Morag[17], which integrates retrieval with generative priors. The trade-off involves balancing computational overhead of multi-scale analysis against improved discrimination of subtle motion variations. Open questions remain about how frequency-based features generalize across diverse motion datasets and whether hybrid architectures combining wavelet decompositions with transformer-based encoders can further enhance retrieval precision without sacrificing efficiency.

Claimed Contributions

WaMo: Wavelet-based multi-frequency feature extraction framework for text-motion retrieval

10 retrieved papers

The authors propose WaMo, a framework that uses wavelet transforms to decompose motion signals into multiple frequency components, capturing both frequency-specific characteristics and their inter-dependencies. This enables fine-grained alignment between 3D motion sequences and textual descriptions by preserving local kinematic details and global motion semantics.

10 retrieved papers

Trajectory Wavelet Reconstruction module with learnable inverse wavelet transforms

0 retrieved papers

The authors introduce a reconstruction module that applies learnable inverse wavelet transforms to recover original joint trajectories from extracted motion features. This acts as a regularization mechanism to ensure that the motion encoder captures essential spatial-temporal information, improving feature quality and discrimination.

0 retrieved papers

Disordered Motion Sequence Prediction for temporal structure learning

Can Refute

2 retrieved papers

The authors propose a self-supervised task where the model learns to recover the correct temporal ordering from shuffled motion sequences. This approach explicitly enforces learning of temporal dynamics and coherence in motions, improving alignment with sequential textual descriptions.

2 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

WaMo: Wavelet-based multi-frequency feature extraction framework for text-motion retrieval

[51] Identity-Preserving Text-To-Video Generation by Frequency Decomposition PDF

Cannot Refute

[52] Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection PDF

Cannot Refute

[53] Free-T2M: Robust Text-to-Motion Generation for Humanoid Robots via Frequency-Domain PDF

Cannot Refute

[54] ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model PDF

Cannot Refute

[55] Signal analysis: time, frequency, scale, and structure PDF

Cannot Refute

[56] SpectraSpan: Zero Fine-Tuning Long Video Generation Framework and Its Frequency Domain Optimization PDF

Cannot Refute

[57] LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model PDF

Cannot Refute

[58] Automatic video superimposed text detection based on nonsubsampled contourlet transform PDF

Cannot Refute

[59] Detecting both superimposed and scene text with multiple languages and multiple alignments in video PDF

Cannot Refute

[60] Multimedia systems: content-based indexing and retrieval PDF

Cannot Refute

Contribution

Trajectory Wavelet Reconstruction module with learnable inverse wavelet transforms

Contribution

Disordered Motion Sequence Prediction for temporal structure learning

[62] Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models PDF

Can Refute

[61] PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding? PDF

Cannot Refute

WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

WaMo: Wavelet-based multi-frequency feature extraction framework for text-motion retrieval

[51] Identity-Preserving Text-To-Video Generation by Frequency Decomposition PDF

[52] Multimodal motion conditioned diffusion model for skeleton-based video anomaly detection PDF

[53] Free-T2M: Robust Text-to-Motion Generation for Humanoid Robots via Frequency-Domain PDF

[54] ANT: Adaptive Neural Temporal-Aware Text-to-Motion Model PDF

[55] Signal analysis: time, frequency, scale, and structure PDF

[56] SpectraSpan: Zero Fine-Tuning Long Video Generation Framework and Its Frequency Domain Optimization PDF

[57] LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model PDF

[58] Automatic video superimposed text detection based on nonsubsampled contourlet transform PDF

[59] Detecting both superimposed and scene text with multiple languages and multiple alignments in video PDF

[60] Multimedia systems: content-based indexing and retrieval PDF

Trajectory Wavelet Reconstruction module with learnable inverse wavelet transforms

Disordered Motion Sequence Prediction for temporal structure learning

[62] Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models PDF

[61] PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding? PDF

Table of Contents