From Pixels to Semantics: Unified Facial Action Representation Learning for Micro-Expression Analysis

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

micro-expression recognitionmicro-expression generation

Micro-expression recognition (MER) is highly challenging due to the subtle and rapid facial muscle movements and the scarcity of annotated data. Existing methods typically rely on pixel-level motion descriptors such as optical flow and frame difference, which tend to be sensitive to identity and lack generalization. In this work, we propose D-FACE, a Discrete Facial ACtion Encoding framework that leverages large-scale facial video data to pretrain an identity- and domain-invariant facial action tokenizer, for MER. For the first time, MER is shifted from relying on pixel-level motion descriptors to leveraging semantic-level facial action tokens, providing compact and generalizable representations of facial dynamics. Empirical analyses reveal that these tokens exhibit position-dependent semantics, motivating sequential modeling. Building on this insight, we employ a Transformer with sparse attention pooling to selectively capture discriminative action cues. Furthermore, to explicitly bridge action tokens with human-understandable emotions, we introduce an emotion-description-guided CLIP (EDCLIP) alignment. EDCLIP leverages textual prompts as semantic anchors for representation learning, while enforcing that the "others" category, which lacks corresponding prompts due to its ambiguity, remains distant from all anchor prompts. Extensive experiments on multiple datasets demonstrate that our method achieves not only state-of-the-art recognition accuracy but also high-quality cross-identity and even cross-domain micro-expression generation, suggesting a paradigm shift from pixel-level to generalizable semantic-level facial motion analysis.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces D-FACE, a discrete facial action encoding framework that shifts micro-expression recognition from pixel-level motion descriptors to semantic-level action tokens. Within the taxonomy, it occupies the 'Discrete Facial Action Encoding' leaf under 'Unified Representation Learning', where it is currently the sole representative. This positioning suggests the paper targets a relatively sparse research direction—one that bridges low-level visual features and high-level semantic representations—while neighboring leaves like 'Prompt-Based Adaptation' contain only one other work, indicating that unified semantic encoding for micro-expressions remains an emerging area.

The taxonomy reveals that most prior work clusters around either AU-centric modeling (e.g., Transformer-Based AU Detection, Continuous AU Intensity Modeling) or motion-based extraction (e.g., Optical Flow Magnification). D-FACE diverges by pretraining an identity-invariant tokenizer rather than relying on predefined AU labels or optical flow. Its closest conceptual neighbors are methods in 'Multimodal Fusion with Language Priors', particularly CLIP-based alignment, which also seek to inject semantic structure. However, those approaches typically fuse visual and textual modalities post-hoc, whereas D-FACE embeds semantic structure directly into the action encoding stage, suggesting a distinct methodological stance.

Across all three contributions, the analysis examined ten candidates each, with zero refutable pairs identified in any case. For the core D-FACE framework, the absence of overlapping prior work among the thirty candidates examined aligns with its unique position as the only paper in its taxonomy leaf. The sequential modeling contribution and the EDCLIP alignment similarly show no clear refutation, though the limited search scope (top-30 semantic matches) means that more distant or domain-specific prior work may exist outside this sample. The statistics suggest that, within the examined literature, these contributions appear relatively novel, but the analysis does not claim exhaustive coverage.

Given the sparse taxonomy leaf and the absence of refutable candidates among thirty examined papers, D-FACE appears to occupy a novel niche in micro-expression recognition. However, the limited search scope and the emerging nature of unified semantic encoding mean that broader or more specialized literature may reveal additional context. The analysis provides a snapshot of novelty within the examined sample, not a definitive field-wide assessment.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Micro-expression recognition from facial action tokens. The field has evolved around several complementary strategies for capturing and representing the subtle, fleeting facial movements that characterize micro-expressions. At the highest level, one branch focuses on Facial Action Unit Detection and Modeling, where methods such as Auformer[2] and AU-LLM[10] leverage structured AU representations to parse fine-grained facial dynamics. A second branch emphasizes Motion-Based Feature Extraction, with approaches like Optical Flow Magnification[11] and Motion Prompt Tuning[6] extracting temporal cues from pixel-level motion patterns. Meanwhile, Unified Representation Learning seeks to bridge low-level visual features and high-level semantic tokens—exemplified by works such as Pixels to Semantics[0] and PTSR[3]—while Efficient Token-Based Architectures explore lightweight designs like MicroMamba[8] and SL-Swin[4]. Finally, Multimodal Fusion with Language Priors integrates textual or semantic guidance, as seen in Contrastive Language-Image[13] and Behavior Prompted Learning[1], to enrich visual encodings with contextual knowledge. A particularly active line of inquiry revolves around how to encode facial actions in a discrete, interpretable manner without sacrificing the rich temporal information inherent in micro-expressions. Pixels to Semantics[0] sits squarely within the Unified Representation Learning branch, proposing a discrete facial action encoding that transforms raw pixel data into semantic tokens. This contrasts with motion-centric methods like Optical Flow Magnification[11], which prioritize amplifying subtle movements, and with AU-focused pipelines such as Auformer[2], which rely on predefined action unit labels. Compared to PTSR[3], which also pursues unified representations, Pixels to Semantics[0] emphasizes the transition from continuous features to discrete tokens, potentially offering greater interpretability and modularity. Meanwhile, multimodal approaches like Behavior Prompted Learning[1] and Contrastive Language-Image[13] highlight an emerging trend of injecting language priors to guide recognition, raising questions about how best to balance data-driven token learning with external semantic constraints.

Claimed Contributions

D-FACE: Discrete Facial Action Encoding framework for micro-expression recognition

10 retrieved papers

The authors propose D-FACE, a framework that shifts micro-expression recognition from pixel-level motion descriptors (like optical flow) to semantic-level facial action tokens. These tokens are pretrained on large-scale facial video data and are designed to be identity- and domain-invariant, providing a generalizable representation of facial dynamics.

10 retrieved papers

Sequential modeling of order-dependent facial action tokens using Transformer with sparse attention pooling

10 retrieved papers

The authors discover through empirical analysis that facial action tokens have order-dependent semantics and can be organized as one-dimensional sequences. They leverage this insight by using a Transformer with sparse attention pooling to selectively capture discriminative action cues for micro-expression recognition.

10 retrieved papers

Emotion-description-guided CLIP alignment (EDCLIP) for bridging action tokens with emotions

10 retrieved papers

The authors introduce EDCLIP, an alignment mechanism that uses textual prompts combining emotion names with facial action descriptions as semantic anchors. This approach explicitly connects learned action tokens to human-understandable emotions, while handling the ambiguous others category by pushing it away from emotion-specific textual embeddings rather than aligning it to an ill-defined description.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

D-FACE: Discrete Facial Action Encoding framework for micro-expression recognition

[24] Facial Micro-Motion-Aware Mixup for Micro-Expression Recognition PDF

Cannot Refute

[25] Micro-expression recognition based on differential feature fusion PDF

Cannot Refute

[26] Micron-BERT: BERT-Based Facial Micro-Expression Recognition PDF

Cannot Refute

[27] Action unit based micro-expression recognition framework for driver emotional state detection PDF

Cannot Refute

[28] Machine learning for perceiving facial micro-expression PDF

Cannot Refute

[29] Facial expression-based lie detection: A survey of micro expression analysis, datasets and challenges PDF

Cannot Refute

[30] Emotion-Qwen-VL: A fully fine-tuned multimodal large language model for micro-expression visual question answering PDF

Cannot Refute

[31] Cross-domain few-shot micro-expression recognition incorporating action units PDF

Cannot Refute

[32] Automatic Analysis of Macro and Micro Facial Expressions: Detection and Recognition via Machine Learning PDF

Cannot Refute

[33] Micro-Expression Recognition Based on Attribute Information Embedding and Cross-modal Contrastive Learning PDF

Cannot Refute

Contribution

Sequential modeling of order-dependent facial action tokens using Transformer with sparse attention pooling

[2] Auformer: Vision transformers are parameter-efficient facial action unit detectors PDF

Cannot Refute

[15] Constrained and directional ensemble attention for facial action unit detection PDF

Cannot Refute

[16] Convolutional attention based mechanism for facial microexpression recognition PDF

Cannot Refute

[17] A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition PDF

Cannot Refute

[18] FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Action Segmentation PDF

Cannot Refute

[19] Transfer: Learning relation-aware facial expression representations with transformers PDF

Cannot Refute

[20] Expression snippet transformer for robust video-based facial expression recognition PDF

Cannot Refute

[21] A Novel Transformer-based approach for adult's facial emotion recognition PDF

Cannot Refute

[22] An attention-based method for multi-label facial action unit detection PDF

Cannot Refute

[23] Expression Recognition Based on Visual Transformers with Novel Attentional Fusion PDF

Cannot Refute

Contribution

Emotion-description-guided CLIP alignment (EDCLIP) for bridging action tokens with emotions

[34] ContextFace: Generating Facial Expressions from Emotional Contexts PDF

Cannot Refute

[35] SynFER: Towards boosting facial expression recognition with synthetic data PDF

Cannot Refute

[36] Expression-driven monocular 3D face reconstruction based on cross-modal guidance PDF

Cannot Refute

[37] Multimodal Prompt Alignment for Facial Expression Recognition PDF

Cannot Refute

[38] Instructavatar: Text-guided emotion and motion control for avatar generation PDF

Cannot Refute

[39] MER-CLIP: AU-Guided Vision-Language Alignment for Micro-Expression Recognition PDF

Cannot Refute

[40] Medtalk: Multimodal controlled 3d facial animation with dynamic emotions by disentangled embedding PDF

Cannot Refute

[41] ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment PDF

Cannot Refute

[42] Synchrorama: Lip-synchronized and emotion-aware talking face generation via multi-modal emotion embedding PDF

Cannot Refute

[43] Facial affective behavior analysis with instruction tuning PDF

Cannot Refute

From Pixels to Semantics: Unified Facial Action Representation Learning for Micro-Expression Analysis

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

D-FACE: Discrete Facial Action Encoding framework for micro-expression recognition

[24] Facial Micro-Motion-Aware Mixup for Micro-Expression Recognition PDF

[25] Micro-expression recognition based on differential feature fusion PDF

[26] Micron-BERT: BERT-Based Facial Micro-Expression Recognition PDF

[27] Action unit based micro-expression recognition framework for driver emotional state detection PDF

[28] Machine learning for perceiving facial micro-expression PDF

[29] Facial expression-based lie detection: A survey of micro expression analysis, datasets and challenges PDF

[30] Emotion-Qwen-VL: A fully fine-tuned multimodal large language model for micro-expression visual question answering PDF

[31] Cross-domain few-shot micro-expression recognition incorporating action units PDF

[32] Automatic Analysis of Macro and Micro Facial Expressions: Detection and Recognition via Machine Learning PDF

[33] Micro-Expression Recognition Based on Attribute Information Embedding and Cross-modal Contrastive Learning PDF

Sequential modeling of order-dependent facial action tokens using Transformer with sparse attention pooling

[2] Auformer: Vision transformers are parameter-efficient facial action unit detectors PDF

[15] Constrained and directional ensemble attention for facial action unit detection PDF

[16] Convolutional attention based mechanism for facial microexpression recognition PDF

[17] A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition PDF

[18] FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Action Segmentation PDF

[19] Transfer: Learning relation-aware facial expression representations with transformers PDF

[20] Expression snippet transformer for robust video-based facial expression recognition PDF

[21] A Novel Transformer-based approach for adult's facial emotion recognition PDF

[22] An attention-based method for multi-label facial action unit detection PDF

[23] Expression Recognition Based on Visual Transformers with Novel Attentional Fusion PDF

Emotion-description-guided CLIP alignment (EDCLIP) for bridging action tokens with emotions

[34] ContextFace: Generating Facial Expressions from Emotional Contexts PDF

[35] SynFER: Towards boosting facial expression recognition with synthetic data PDF

[36] Expression-driven monocular 3D face reconstruction based on cross-modal guidance PDF

[37] Multimodal Prompt Alignment for Facial Expression Recognition PDF

[38] Instructavatar: Text-guided emotion and motion control for avatar generation PDF

[39] MER-CLIP: AU-Guided Vision-Language Alignment for Micro-Expression Recognition PDF

[40] Medtalk: Multimodal controlled 3d facial animation with dynamic emotions by disentangled embedding PDF

[41] ExpCLIP: Bridging Text and Facial Expressions via Semantic Alignment PDF

[42] Synchrorama: Lip-synchronized and emotion-aware talking face generation via multi-modal emotion embedding PDF

[43] Facial affective behavior analysis with instruction tuning PDF

Table of Contents