REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport

ICLR 2026 Conference SubmissionAnonymous Authors
Optimal TransportProcedure learningEgocentric visionEgoProceLFused Partial GWOT
Abstract:

Learning from procedural videos remains a core challenge in self-supervised representation learning, as real-world instructional data often contains background segments, repeated actions, and steps presented out of order. Such variability violates the strong monotonicity assumptions underlying many alignment methods. Prior state-of-the-art approaches, such as OPEL and RGWOT, leverage Kantorovich Optimal Transport (KOT) and Gromov-Wasserstein Optimal Transport (GWOT) to build frame-to-frame correspondences but operate only on local feature similarity and pairwise relational structure, without explicit temporal priors, which limits their ability to capture the higher-order temporal structure of a task. In this paper, we introduce REALIGN, an unsupervised framework for procedure learning based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport (R-FPGWOT). In contrast to RGWOT, our formulation jointly models visual correspondences and temporal relations under a partial alignment scheme, enabling robust handling of irrelevant frames, repeated actions, and non-monotonic step orders common in instructional videos. To stabilize training, we integrate FPGWOT distances with inter-sequence contrastive learning, avoiding the need for multiple regularizers and preventing collapse to degenerate solutions. Across egocentric (EgoProceL) and third-person (ProceL, CrossTask) benchmarks, REALIGN achieves up to 18.9% (7.62pp) average F1-score improvements and over 30% (7.74pp) temporal IoU gains, while producing more interpretable transport maps that preserve key-step orderings and filter out noise.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces REALIGN, a framework based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport for unsupervised procedure learning from instructional videos. It resides in the Optimal Transport-Based Alignment leaf, which contains only two papers including this one. This leaf sits within the broader Temporal Alignment and Correspondence Learning branch, which encompasses three distinct methodological approaches. The sparse population of this specific leaf suggests that optimal transport formulations for procedural alignment remain relatively underexplored compared to embedding-based or cross-modal methods.

The taxonomy reveals that Temporal Alignment and Correspondence Learning is one of eight major research directions in the field. Neighboring branches include Step Discovery and Segmentation, which focuses on identifying action boundaries without alignment, and Procedure Representation and Task Modeling, which builds structured task graphs. The scope note for Optimal Transport-Based Alignment explicitly excludes contrastive or embedding methods, positioning REALIGN within a methodologically distinct subfield. The broader Temporal Alignment branch contains eleven papers across three leaves, indicating moderate activity in correspondence learning overall but concentration in embedding-based approaches rather than transport-based formulations.

Among ten candidates examined, the core REALIGN framework contribution shows one refutable candidate from six examined, while the unified alignment loss contribution also identifies one refutable candidate from four examined. The partial alignment scheme contribution was not tested against any candidates. The limited search scope—ten total candidates rather than an exhaustive survey—means these statistics reflect only top-K semantic matches and immediate citations. The presence of refutable candidates for two of three contributions suggests some overlap with prior work in the examined sample, though the scale of examination leaves substantial uncertainty about the broader literature landscape.

Given the sparse population of the Optimal Transport-Based Alignment leaf and the limited search scope, the analysis captures a narrow slice of potentially relevant work. The taxonomy structure indicates this is a methodologically specialized area within a diverse field, but the ten-candidate examination cannot definitively characterize novelty relative to the full literature. The refutable candidates identified represent overlaps within the examined sample, not comprehensive prior art assessment.

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
10
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: unsupervised procedure learning from instructional videos. The field aims to extract structured procedural knowledge—such as step sequences, temporal boundaries, and task dependencies—from large collections of unlabeled how-to videos. The taxonomy reflects a diverse landscape organized around several complementary challenges. Temporal Alignment and Correspondence Learning focuses on matching video segments across demonstrations, often using techniques like optimal transport to align steps without explicit labels. Step Discovery and Segmentation addresses the problem of identifying meaningful action boundaries and clustering them into coherent steps, as seen in works like Action Discovery[4] and Action Segmentation[5]. Procedure Representation and Task Modeling emphasizes building graph-based or hierarchical structures that capture dependencies and ordering constraints, with approaches such as Task Graphs[12] and Differentiable Task Graph[17]. Pretraining and Representation Learning explores self-supervised objectives that yield embeddings sensitive to procedural structure, while Procedure Planning and Goal-Directed Reasoning targets the synthesis of step sequences for novel goals. Weakly Supervised and Cross-Task Learning leverages partial annotations or transfers knowledge across related tasks, and Specialized Procedure Understanding Tasks tackle domain-specific challenges like error detection or localized instruction generation. Auxiliary Methods and Resources provide datasets and supporting techniques, and Automatic Procedure Learning from Web Videos scales these ideas to noisy, in-the-wild data. Several active lines of work highlight key trade-offs and open questions. One thread pursues robust temporal alignment methods that can handle high variability across demonstrations, balancing computational efficiency with alignment quality. Another explores how to discover and segment steps in a fully unsupervised manner, often debating whether to rely on clustering in learned feature spaces or to impose stronger structural priors. Graph-based representations have gained traction for capturing task dependencies, yet questions remain about how to learn these graphs from raw video without ground-truth annotations. REALIGN[0] sits within the Temporal Alignment and Correspondence Learning branch, specifically under Optimal Transport-Based Alignment, and shares methodological kinship with Dynamic Summarization[7], which also addresses correspondence across video segments. Compared to earlier alignment work like Narrated Instruction Learning[2] or Automatic Procedure Learning[3], REALIGN[0] emphasizes principled transport-based matching to handle diverse procedural variations, positioning it as a recent refinement in the ongoing effort to align instructional content at scale.

Claimed Contributions

REALIGN framework based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport

The authors propose REALIGN, a novel unsupervised procedure learning framework that extends Fused Gromov-Wasserstein Optimal Transport with partial alignment constraints. This formulation jointly models visual correspondences and temporal relations while enabling robust handling of irrelevant frames, repeated actions, and non-monotonic step orders common in instructional videos.

6 retrieved papers
Can Refute
Partial alignment scheme with virtual sink node for handling background and redundant frames

The method introduces a partial transport formulation that relaxes balanced marginal constraints by incorporating a virtual sink node. This allows irrelevant or background frames to be mapped to a null mass instead of being forced into spurious correspondences, addressing a key limitation of prior fully balanced optimal transport methods.

0 retrieved papers
Unified alignment loss integrating temporal priors and contrastive regularization

The authors develop a unified loss function that combines FPGWOT distances with Laplace-shaped temporal priors, structural regularization, and inter-sequence contrastive learning. This integration stabilizes training by avoiding degenerate solutions and preventing collapse to trivial mappings without requiring multiple separate regularizers.

4 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

REALIGN framework based on Regularized Fused Partial Gromov-Wasserstein Optimal Transport

The authors propose REALIGN, a novel unsupervised procedure learning framework that extends Fused Gromov-Wasserstein Optimal Transport with partial alignment constraints. This formulation jointly models visual correspondences and temporal relations while enabling robust handling of irrelevant frames, repeated actions, and non-monotonic step orders common in instructional videos.

Contribution

Partial alignment scheme with virtual sink node for handling background and redundant frames

The method introduces a partial transport formulation that relaxes balanced marginal constraints by incorporating a virtual sink node. This allows irrelevant or background frames to be mapped to a null mass instead of being forced into spurious correspondences, addressing a key limitation of prior fully balanced optimal transport methods.

Contribution

Unified alignment loss integrating temporal priors and contrastive regularization

The authors develop a unified loss function that combines FPGWOT distances with Laplace-shaped temporal priors, structural regularization, and inter-sequence contrastive learning. This integration stabilizes training by avoiding degenerate solutions and preventing collapse to trivial mappings without requiring multiple separate regularizers.

REALIGN: Regularized Procedure Alignment with Matching Video Embeddings via Partial Gromov-Wasserstein Optimal Transport | Novelty Validation