EMBridge: Enhancing Gesture Generalization from EMG Signals Through Cross-modal Representation Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

EMGZero-shot Gesture ClassificationCross-modalRepresentation Learning

Hand gesture classification using high-quality structured data such as videos, images, and hand skeletons is a well-explored problem in computer vision. Alternatively, leveraging low-power, cost-effective bio-signals, e.g. surface electromyography (sEMG), allows for continuous gesture prediction on wearable devices. In this work, we aim to enhance EMG representation quality by aligning it with embeddings obtained from structured, high-quality modalities that provide richer semantic guidance, ultimately enabling zero-shot gesture generalization. Specifically, we propose EMBridge, a cross-modal representation learning framework that bridges the modality gap between EMG and pose. EMBridge learns high-quality EMG representations by introducing a Querying Transformer (Q-Former), a masked pose reconstruction loss, and a community-aware soft contrastive learning objective that aligns the relative geometry of the embedding spaces. We evaluate EMBridge on both in-distribution and unseen gesture classification tasks and demonstrate consistent performance gains over all baselines. To the best of our knowledge, EMBridge is the first cross-modal representation learning framework to achieve zero-shot gesture classification from wearable EMG signals, showing potential toward real-world gesture recognition on wearable devices.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes EMBridge, a cross-modal representation learning framework that aligns EMG signals with pose embeddings to enable zero-shot gesture classification. It resides in the 'Cross-Modal Alignment with Pose or Video' leaf, which contains only one other sibling paper (CPEP). This leaf sits within the broader 'Cross-Modal Representation Learning and Foundation Models' branch, indicating a relatively sparse but emerging research direction. The taxonomy shows that cross-modal alignment represents one of several complementary strategies in the field, alongside domain adaptation, compositional representations, and foundation models.

The taxonomy reveals neighboring research directions that pursue zero-shot generalization through different mechanisms. The sibling branch 'Foundation Models for EMG' contains three papers building large-scale pre-trained models without cross-modal supervision. Adjacent branches include 'Cross-User and Cross-Session Domain Adaptation' (ten papers across three leaves) addressing user variability through transfer learning, and 'Zero-Shot Learning with Semantic Attributes' (one paper) using class descriptions rather than visual modalities. EMBridge diverges from these by leveraging structured pose data as supervisory signal, rather than relying solely on EMG-domain techniques or semantic metadata.

Among the three contributions analyzed, the framework and contrastive objective appear relatively novel within the limited search scope. The 'EMBridge framework' examined ten candidates with zero refutations, and the 'CASCLe objective' similarly found no overlapping prior work among ten candidates. However, the 'first zero-shot gesture classification' claim examined nine candidates and identified three potentially refutable papers, suggesting that zero-shot EMG gesture recognition has been explored previously. The analysis is based on twenty-nine total candidates from semantic search, not an exhaustive literature review, so these findings reflect the most semantically similar work rather than complete field coverage.

The limited search scope (twenty-nine candidates) and sparse taxonomy leaf (two papers total) suggest this work occupies a relatively unexplored intersection of cross-modal learning and EMG-based gesture recognition. The refutation of the 'first zero-shot' claim indicates that while the specific framework may be novel, the broader goal has precedent. The analysis cannot determine whether EMBridge's technical approach—combining Q-Former, masked reconstruction, and community-aware contrastive learning—represents a significant departure from the one identified sibling paper or the three refuting candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: zero-shot gesture classification from wearable EMG signals. The field addresses the challenge of recognizing hand and finger gestures from electromyographic recordings without requiring labeled training data for every gesture class. The taxonomy reveals a rich landscape organized around several complementary strategies. Cross-user and cross-session domain adaptation methods tackle the variability that arises when models trained on one individual or recording session must generalize to new users or conditions. Cross-modal representation learning and foundation models leverage auxiliary modalities—such as video, pose, or kinematic data—to build shared embeddings that enable zero-shot transfer. Compositional and disentangled latent representations aim to factorize gesture signals into interpretable components, while real-time intent detection and segmentation focus on continuous recognition pipelines. Supervised classification architectures and unsupervised feature learning provide the backbone techniques, and specialized branches address wrist-based compact systems, robustness to electrode shift and posture variation, anomaly detection for out-of-distribution rejection, and zero-shot learning with semantic attributes. Methodological reviews and application-specific recognition round out the taxonomy, reflecting both foundational research and deployment concerns. A particularly active line of work explores cross-modal alignment, where methods like EMBridge[0] and CPEP[20] align EMG features with visual or kinematic representations to enable zero-shot generalization. These approaches contrast with purely signal-driven strategies such as domain adaptation (Linear Domain Adaptation[3]) or unsupervised clustering (Fuzzy Clustering EMG[8]), which do not rely on auxiliary modalities. Foundation models (EMG Foundation Model[4]) represent an emerging direction that seeks to pretrain large-scale representations across diverse datasets, bridging multiple branches of the taxonomy. EMBridge[0] sits squarely within the cross-modal alignment cluster, emphasizing the use of pose or video as a supervisory signal to learn transferable EMG embeddings. Compared to CPEP[20], which also pursues cross-modal learning, EMBridge[0] may differ in the specific alignment objective or the choice of auxiliary modality, while both share the goal of enabling recognition of unseen gestures through learned correspondences rather than direct labeled supervision.

Claimed Contributions

EMBridge cross-modal representation learning framework

10 retrieved papers

The authors introduce EMBridge, a framework that enhances EMG representation quality by aligning it with pose embeddings through three components: a Querying Transformer (Q-Former), a masked pose reconstruction loss, and a community-aware soft contrastive learning objective that aligns the relative geometry of embedding spaces.

10 retrieved papers

Community-aware soft contrastive learning (CASCLe) objective

10 retrieved papers

The authors propose CASCLe, a novel contrastive learning objective that constructs soft targets based on community-level structural similarities in the pose embedding space rather than treating all non-matching samples as equally distant negatives, thereby capturing semantic relationships between poses.

10 retrieved papers

First zero-shot gesture classification from wearable EMG signals

Can Refute

9 retrieved papers

The authors claim that EMBridge is the first framework to achieve zero-shot gesture classification from wearable EMG signals, demonstrating the ability to recognize novel gestures without requiring training samples for those gestures.

9 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[20] CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals PDF

Cui Wen-hui, Wenhui Cui, Pouransari, Hadi, Chris Sandino, Liu Ran, Hadi Pouransari, Ran Liu, Juri Minxha, Verma, Aman, Ellen L. Zippi, Aman Verma, Azemi, Erdrin, Anna Sedlackova, Mahasseni Behrooz, Erdrin Azemi, Behrooz Mahasseni (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EMBridge cross-modal representation learning framework

[56] A multimodal transformer framework with biomechanical constraints for injury prediction and human motion analysis PDF

Cannot Refute

[57] Alignment-enhanced interactive fusion model for complete and incomplete multimodal hand gesture recognition PDF

Cannot Refute

[58] From Wrist to Finger: Hand Pose Tracking Using Ring-Watch Wearables PDF

Cannot Refute

[59] Multimodal Transformer Models for Human Action Classification PDF

Cannot Refute

[60] Multimodal pose estimation and simulation modelling for real-time human motion analysis PDF

Cannot Refute

[61] Smooth Multiscale Convolutional Attention Transformer Network for Continuous Motion Estimation of Hand Knuckle Angle using Surface EMG Signals PDF

Cannot Refute

[62] Efficient transformer for sEMG-based hand pose estimation using inter-channel relationship PDF

Cannot Refute

[63] PREDICTING CONTINUOUS HAND POSE FROM WEARABLE EMG SENSOR DATA USING TRANSFORMER-BASED DEEP-LEARNING MODELS PDF

Cannot Refute

[64] FEASIBILITY OF AN INTEGRATED MULTI-MODEL APPROACH FOR DYNAMIC MUSCULOSKELETAL DISORDER RISK ASSESSMENT PDF

Cannot Refute

[65] sEMG-vision Tra: A Gesture Recognition Method Based on Surface EMG Signal-Vision Fusion PDF

Cannot Refute

Contribution

Community-aware soft contrastive learning (CASCLe) objective

[41] Multi-Graph Contrastive Learning for Community Detection in Multi-Layer Networks PDF

Cannot Refute

[42] Adaptive graph contrastive learning for community detection PDF

Cannot Refute

[43] Motif-Based Contrastive Learning for Community Detection PDF

Cannot Refute

[44] Rcoco: contrastive collective link prediction across multiplex network in Riemannian space PDF

Cannot Refute

[45] Contrastive learning for multi-layer network community detection via learnable network augmentation PDF

Cannot Refute

[46] Supporting clustering with contrastive learning PDF

Cannot Refute

[47] Single-View Graph Contrastive Learning with Soft Neighborhood Awareness PDF

Cannot Refute

[48] Open-world semantic segmentation via contrasting and clustering vision-language embedding PDF

Cannot Refute

[49] Structure-Enhanced Contrastive Learning for Graph Clustering PDF

Cannot Refute

[50] Graph-text contrastive learning of inorganic crystal structure toward a foundation model of inorganic materials PDF

Cannot Refute

Contribution

First zero-shot gesture classification from wearable EMG signals

[11] Big data in myoelectric control: large multi-user models enable robust zero-shot EMG-based discrete gesture recognition PDF

Can Refute

[20] CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals PDF

Can Refute

[21] From zero- to few-shot: deep temporal learning of wrist EMG enables scalable cross-user gesture recognition PDF

Can Refute

[29] Cross-subject EMG hand gesture recognition based on dynamic domain generalization PDF

Cannot Refute

[51] Transfer-Modal Extraction of Surface EMG Features for Upper Limb Motor Classification PDF

Cannot Refute

[52] Gesture Recognition of EMG Signals Based on Migration Learning PDF

Cannot Refute

[53] Brain-inspired self-organization with cellular neuromorphic computing for multimodal unsupervised learning PDF

Cannot Refute

[54] From Pose to Muscle: Multimodal Learning for Piano Hand Muscle Electromyography PDF

Cannot Refute

[55] Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth, and Skeleton Data PDF

Cannot Refute

EMBridge: Enhancing Gesture Generalization from EMG Signals Through Cross-modal Representation Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[20] CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals PDF

Contribution Analysis

EMBridge cross-modal representation learning framework

[56] A multimodal transformer framework with biomechanical constraints for injury prediction and human motion analysis PDF

[57] Alignment-enhanced interactive fusion model for complete and incomplete multimodal hand gesture recognition PDF

[58] From Wrist to Finger: Hand Pose Tracking Using Ring-Watch Wearables PDF

[59] Multimodal Transformer Models for Human Action Classification PDF

[60] Multimodal pose estimation and simulation modelling for real-time human motion analysis PDF

[61] Smooth Multiscale Convolutional Attention Transformer Network for Continuous Motion Estimation of Hand Knuckle Angle using Surface EMG Signals PDF

[62] Efficient transformer for sEMG-based hand pose estimation using inter-channel relationship PDF

[63] PREDICTING CONTINUOUS HAND POSE FROM WEARABLE EMG SENSOR DATA USING TRANSFORMER-BASED DEEP-LEARNING MODELS PDF

[64] FEASIBILITY OF AN INTEGRATED MULTI-MODEL APPROACH FOR DYNAMIC MUSCULOSKELETAL DISORDER RISK ASSESSMENT PDF

[65] sEMG-vision Tra: A Gesture Recognition Method Based on Surface EMG Signal-Vision Fusion PDF

Community-aware soft contrastive learning (CASCLe) objective

[41] Multi-Graph Contrastive Learning for Community Detection in Multi-Layer Networks PDF

[42] Adaptive graph contrastive learning for community detection PDF

[43] Motif-Based Contrastive Learning for Community Detection PDF

[44] Rcoco: contrastive collective link prediction across multiplex network in Riemannian space PDF

[45] Contrastive learning for multi-layer network community detection via learnable network augmentation PDF

[46] Supporting clustering with contrastive learning PDF

[47] Single-View Graph Contrastive Learning with Soft Neighborhood Awareness PDF

[48] Open-world semantic segmentation via contrasting and clustering vision-language embedding PDF

[49] Structure-Enhanced Contrastive Learning for Graph Clustering PDF

[50] Graph-text contrastive learning of inorganic crystal structure toward a foundation model of inorganic materials PDF

First zero-shot gesture classification from wearable EMG signals

[11] Big data in myoelectric control: large multi-user models enable robust zero-shot EMG-based discrete gesture recognition PDF

[20] CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals PDF

[21] From zero- to few-shot: deep temporal learning of wrist EMG enables scalable cross-user gesture recognition PDF

[29] Cross-subject EMG hand gesture recognition based on dynamic domain generalization PDF

[51] Transfer-Modal Extraction of Surface EMG Features for Upper Limb Motor Classification PDF

[52] Gesture Recognition of EMG Signals Based on Migration Learning PDF

[53] Brain-inspired self-organization with cellular neuromorphic computing for multimodal unsupervised learning PDF

[54] From Pose to Muscle: Multimodal Learning for Piano Hand Muscle Electromyography PDF

[55] Multimodal Gesture Recognition using CNN-GCN-LSTM with RGB, Depth, and Skeleton Data PDF

Table of Contents