EMBridge: Enhancing Gesture Generalization from EMG Signals Through Cross-modal Representation Learning

ICLR 2026 Conference SubmissionAnonymous Authors
EMGZero-shot Gesture ClassificationCross-modalRepresentation Learning
Abstract:

Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Alternatively, leveraging low-power, cost-effective bio-signals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearable devices. In this work, we aim to enhance EMG representation quality by aligning it with embeddings obtained from structured, high-quality modalities that provide richer semantic guidance, ultimately enabling zero-shot gesture generalization. Specifically, we propose EMBridge, a cross-modal representation learning framework that bridges the modality gap between EMG and pose. EMBridge learns high-quality EMG representations by introducing a Querying Transformer (Q-Former), a masked pose reconstruction loss, and a community-aware soft contrastive learning objective that aligns the relative geometry of the embedding spaces. We evaluate EMBridge on both in-distribution and unseen gesture classification tasks and demonstrate consistent performance gains over all baselines. To the best of our knowledge, EMBridge is the first cross-modal representation learning framework to achieve zero-shot gesture classification from wearable EMG signals, showing potential toward real-world gesture recognition on wearable devices.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes EMBridge, a cross-modal representation learning framework that aligns EMG signals with pose embeddings to enable zero-shot gesture classification. It resides in the 'Cross-Modal Alignment with Pose or Video' leaf, which contains only one other sibling paper (CPEP). This leaf sits within the broader 'Cross-Modal Representation Learning and Foundation Models' branch, indicating a relatively sparse but emerging research direction. The taxonomy shows that cross-modal alignment represents one of several complementary strategies in the field, alongside domain adaptation, compositional representations, and foundation models.

The taxonomy reveals neighboring research directions that pursue zero-shot generalization through different mechanisms. The sibling branch 'Foundation Models for EMG' contains three papers building large-scale pre-trained models without cross-modal supervision. Adjacent branches include 'Cross-User and Cross-Session Domain Adaptation' (ten papers across three leaves) addressing user variability through transfer learning, and 'Zero-Shot Learning with Semantic Attributes' (one paper) using class descriptions rather than visual modalities. EMBridge diverges from these by leveraging structured pose data as supervisory signal, rather than relying solely on EMG-domain techniques or semantic metadata.

Among the three contributions analyzed, the framework and contrastive objective appear relatively novel within the limited search scope. The 'EMBridge framework' examined ten candidates with zero refutations, and the 'CASCLe objective' similarly found no overlapping prior work among ten candidates. However, the 'first zero-shot gesture classification' claim examined nine candidates and identified three potentially refutable papers, suggesting that zero-shot EMG gesture recognition has been explored previously. The analysis is based on twenty-nine total candidates from semantic search, not an exhaustive literature review, so these findings reflect the most semantically similar work rather than complete field coverage.

The limited search scope (twenty-nine candidates) and sparse taxonomy leaf (two papers total) suggest this work occupies a relatively unexplored intersection of cross-modal learning and EMG-based gesture recognition. The refutation of the 'first zero-shot' claim indicates that while the specific framework may be novel, the broader goal has precedent. The analysis cannot determine whether EMBridge's technical approach—combining Q-Former, masked reconstruction, and community-aware contrastive learning—represents a significant departure from the one identified sibling paper or the three refuting candidates.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
29
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: zero-shot gesture classification from wearable EMG signals. The field addresses the challenge of recognizing hand and finger gestures from electromyographic recordings without requiring labeled training data for every gesture class. The taxonomy reveals a rich landscape organized around several complementary strategies. Cross-user and cross-session domain adaptation methods tackle the variability that arises when models trained on one individual or recording session must generalize to new users or conditions. Cross-modal representation learning and foundation models leverage auxiliary modalities—such as video, pose, or kinematic data—to build shared embeddings that enable zero-shot transfer. Compositional and disentangled latent representations aim to factorize gesture signals into interpretable components, while real-time intent detection and segmentation focus on continuous recognition pipelines. Supervised classification architectures and unsupervised feature learning provide the backbone techniques, and specialized branches address wrist-based compact systems, robustness to electrode shift and posture variation, anomaly detection for out-of-distribution rejection, and zero-shot learning with semantic attributes. Methodological reviews and application-specific recognition round out the taxonomy, reflecting both foundational research and deployment concerns. A particularly active line of work explores cross-modal alignment, where methods like EMBridge[0] and CPEP[20] align EMG features with visual or kinematic representations to enable zero-shot generalization. These approaches contrast with purely signal-driven strategies such as domain adaptation (Linear Domain Adaptation[3]) or unsupervised clustering (Fuzzy Clustering EMG[8]), which do not rely on auxiliary modalities. Foundation models (EMG Foundation Model[4]) represent an emerging direction that seeks to pretrain large-scale representations across diverse datasets, bridging multiple branches of the taxonomy. EMBridge[0] sits squarely within the cross-modal alignment cluster, emphasizing the use of pose or video as a supervisory signal to learn transferable EMG embeddings. Compared to CPEP[20], which also pursues cross-modal learning, EMBridge[0] may differ in the specific alignment objective or the choice of auxiliary modality, while both share the goal of enabling recognition of unseen gestures through learned correspondences rather than direct labeled supervision.

Claimed Contributions

EMBridge cross-modal representation learning framework

The authors introduce EMBridge, a framework that enhances EMG representation quality by aligning it with pose embeddings through three components: a Querying Transformer (Q-Former), a masked pose reconstruction loss, and a community-aware soft contrastive learning objective that aligns the relative geometry of embedding spaces.

10 retrieved papers
Community-aware soft contrastive learning (CASCLe) objective

The authors propose CASCLe, a novel contrastive learning objective that constructs soft targets based on community-level structural similarities in the pose embedding space rather than treating all non-matching samples as equally distant negatives, thereby capturing semantic relationships between poses.

10 retrieved papers
First zero-shot gesture classification from wearable EMG signals

The authors claim that EMBridge is the first framework to achieve zero-shot gesture classification from wearable EMG signals, demonstrating the ability to recognize novel gestures without requiring training samples for those gestures.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EMBridge cross-modal representation learning framework

The authors introduce EMBridge, a framework that enhances EMG representation quality by aligning it with pose embeddings through three components: a Querying Transformer (Q-Former), a masked pose reconstruction loss, and a community-aware soft contrastive learning objective that aligns the relative geometry of embedding spaces.

Contribution

Community-aware soft contrastive learning (CASCLe) objective

The authors propose CASCLe, a novel contrastive learning objective that constructs soft targets based on community-level structural similarities in the pose embedding space rather than treating all non-matching samples as equally distant negatives, thereby capturing semantic relationships between poses.

Contribution

First zero-shot gesture classification from wearable EMG signals

The authors claim that EMBridge is the first framework to achieve zero-shot gesture classification from wearable EMG signals, demonstrating the ability to recognize novel gestures without requiring training samples for those gestures.