From Pixels to Semantics: Unified Facial Action Representation Learning for Micro-Expression Analysis

ICLR 2026 Conference SubmissionAnonymous Authors
micro-expression recognitionmicro-expression generation
Abstract:

Micro-expression recognition (MER) is highly challenging due to the subtle and rapid facial muscle movements and the scarcity of annotated data. Existing methods typically rely on pixel-level motion descriptors such as optical flow and frame difference, which tend to be sensitive to identity and lack generalization. In this work, we propose D-FACE, a Discrete Facial ACtion Encoding framework that leverages large-scale facial video data to pretrain an identity- and domain-invariant facial action tokenizer, for MER. For the first time, MER is shifted from relying on pixel-level motion descriptors to leveraging semantic-level facial action tokens, providing compact and generalizable representations of facial dynamics. Empirical analyses reveal that these tokens exhibit position-dependent semantics, motivating sequential modeling. Building on this insight, we employ a Transformer with sparse attention pooling to selectively capture discriminative action cues. Furthermore, to explicitly bridge action tokens with human-understandable emotions, we introduce an emotion-description-guided CLIP (EDCLIP) alignment. EDCLIP leverages textual prompts as semantic anchors for representation learning, while enforcing that the "others" category, which lacks corresponding prompts due to its ambiguity, remains distant from all anchor prompts. Extensive experiments on multiple datasets demonstrate that our method achieves not only state-of-the-art recognition accuracy but also high-quality cross-identity and even cross-domain micro-expression generation, suggesting a paradigm shift from pixel-level to generalizable semantic-level facial motion analysis.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces D-FACE, a discrete facial action encoding framework that shifts micro-expression recognition from pixel-level motion descriptors to semantic-level action tokens. Within the taxonomy, it occupies the 'Discrete Facial Action Encoding' leaf under 'Unified Representation Learning', where it is currently the sole representative. This positioning suggests the paper targets a relatively sparse research direction—one that bridges low-level visual features and high-level semantic representations—while neighboring leaves like 'Prompt-Based Adaptation' contain only one other work, indicating that unified semantic encoding for micro-expressions remains an emerging area.

The taxonomy reveals that most prior work clusters around either AU-centric modeling (e.g., Transformer-Based AU Detection, Continuous AU Intensity Modeling) or motion-based extraction (e.g., Optical Flow Magnification). D-FACE diverges by pretraining an identity-invariant tokenizer rather than relying on predefined AU labels or optical flow. Its closest conceptual neighbors are methods in 'Multimodal Fusion with Language Priors', particularly CLIP-based alignment, which also seek to inject semantic structure. However, those approaches typically fuse visual and textual modalities post-hoc, whereas D-FACE embeds semantic structure directly into the action encoding stage, suggesting a distinct methodological stance.

Across all three contributions, the analysis examined ten candidates each, with zero refutable pairs identified in any case. For the core D-FACE framework, the absence of overlapping prior work among the thirty candidates examined aligns with its unique position as the only paper in its taxonomy leaf. The sequential modeling contribution and the EDCLIP alignment similarly show no clear refutation, though the limited search scope (top-30 semantic matches) means that more distant or domain-specific prior work may exist outside this sample. The statistics suggest that, within the examined literature, these contributions appear relatively novel, but the analysis does not claim exhaustive coverage.

Given the sparse taxonomy leaf and the absence of refutable candidates among thirty examined papers, D-FACE appears to occupy a novel niche in micro-expression recognition. However, the limited search scope and the emerging nature of unified semantic encoding mean that broader or more specialized literature may reveal additional context. The analysis provides a snapshot of novelty within the examined sample, not a definitive field-wide assessment.

Taxonomy

Core-task Taxonomy Papers
14
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Micro-expression recognition from facial action tokens. The field has evolved around several complementary strategies for capturing and representing the subtle, fleeting facial movements that characterize micro-expressions. At the highest level, one branch focuses on Facial Action Unit Detection and Modeling, where methods such as Auformer[2] and AU-LLM[10] leverage structured AU representations to parse fine-grained facial dynamics. A second branch emphasizes Motion-Based Feature Extraction, with approaches like Optical Flow Magnification[11] and Motion Prompt Tuning[6] extracting temporal cues from pixel-level motion patterns. Meanwhile, Unified Representation Learning seeks to bridge low-level visual features and high-level semantic tokens—exemplified by works such as Pixels to Semantics[0] and PTSR[3]—while Efficient Token-Based Architectures explore lightweight designs like MicroMamba[8] and SL-Swin[4]. Finally, Multimodal Fusion with Language Priors integrates textual or semantic guidance, as seen in Contrastive Language-Image[13] and Behavior Prompted Learning[1], to enrich visual encodings with contextual knowledge. A particularly active line of inquiry revolves around how to encode facial actions in a discrete, interpretable manner without sacrificing the rich temporal information inherent in micro-expressions. Pixels to Semantics[0] sits squarely within the Unified Representation Learning branch, proposing a discrete facial action encoding that transforms raw pixel data into semantic tokens. This contrasts with motion-centric methods like Optical Flow Magnification[11], which prioritize amplifying subtle movements, and with AU-focused pipelines such as Auformer[2], which rely on predefined action unit labels. Compared to PTSR[3], which also pursues unified representations, Pixels to Semantics[0] emphasizes the transition from continuous features to discrete tokens, potentially offering greater interpretability and modularity. Meanwhile, multimodal approaches like Behavior Prompted Learning[1] and Contrastive Language-Image[13] highlight an emerging trend of injecting language priors to guide recognition, raising questions about how best to balance data-driven token learning with external semantic constraints.

Claimed Contributions

D-FACE: Discrete Facial Action Encoding framework for micro-expression recognition

The authors propose D-FACE, a framework that shifts micro-expression recognition from pixel-level motion descriptors (like optical flow) to semantic-level facial action tokens. These tokens are pretrained on large-scale facial video data and are designed to be identity- and domain-invariant, providing a generalizable representation of facial dynamics.

10 retrieved papers
Sequential modeling of order-dependent facial action tokens using Transformer with sparse attention pooling

The authors discover through empirical analysis that facial action tokens have order-dependent semantics and can be organized as one-dimensional sequences. They leverage this insight by using a Transformer with sparse attention pooling to selectively capture discriminative action cues for micro-expression recognition.

10 retrieved papers
Emotion-description-guided CLIP alignment (EDCLIP) for bridging action tokens with emotions

The authors introduce EDCLIP, an alignment mechanism that uses textual prompts combining emotion names with facial action descriptions as semantic anchors. This approach explicitly connects learned action tokens to human-understandable emotions, while handling the ambiguous others category by pushing it away from emotion-specific textual embeddings rather than aligning it to an ill-defined description.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

D-FACE: Discrete Facial Action Encoding framework for micro-expression recognition

The authors propose D-FACE, a framework that shifts micro-expression recognition from pixel-level motion descriptors (like optical flow) to semantic-level facial action tokens. These tokens are pretrained on large-scale facial video data and are designed to be identity- and domain-invariant, providing a generalizable representation of facial dynamics.

Contribution

Sequential modeling of order-dependent facial action tokens using Transformer with sparse attention pooling

The authors discover through empirical analysis that facial action tokens have order-dependent semantics and can be organized as one-dimensional sequences. They leverage this insight by using a Transformer with sparse attention pooling to selectively capture discriminative action cues for micro-expression recognition.

Contribution

Emotion-description-guided CLIP alignment (EDCLIP) for bridging action tokens with emotions

The authors introduce EDCLIP, an alignment mechanism that uses textual prompts combining emotion names with facial action descriptions as semantic anchors. This approach explicitly connects learned action tokens to human-understandable emotions, while handling the ambiguous others category by pushing it away from emotion-specific textual embeddings rather than aligning it to an ill-defined description.