Concepts in Motion: Temporal Bottlenecks for Interpretable Video Classification

ICLR 2026 Conference SubmissionAnonymous Authors
Concept Bottleneck ModelsComputer VisionInterpretabilityVideo Classification
Abstract:

Conceptual models such as Concept Bottleneck Models (CBMs) have driven substantial progress in improving interpretability for image classification by leveraging human-interpretable concepts. However, extending these models from static images to sequences of images, such as video data, introduces a significant challenge due to the temporal dependencies inherent in videos, which are essential for capturing actions and events. In this work, we introduce MoTIF\textbf{MoTIF} (Moving Temporal Interpretable Framework), an architectural design inspired by a transformer that adapts the concept bottleneck framework for video classification and handles sequences of arbitrary length. Within the video domain, concepts refer to semantic entities such as objects, attributes, or higher-level components (e.g., “bow,” “mount,” “shoot”) that reoccur across time—forming motifs collectively describing and explaining actions. Our design explicitly enables three complementary perspectives: global concept importance across the entire video, local concept relevance within specific windows, and temporal dependencies of a concept over time. Our results demonstrate that the concept-based modeling paradigm can be effectively transferred to video data, enabling a better understanding of concept contributions in temporal contexts while maintaining a competitive performance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MoTIF, a transformer-inspired architecture that extends concept bottleneck models from static images to video sequences. It resides in the 'Transformer-Based Temporal Concept Frameworks' leaf, which currently contains only this work as its sole member. This placement reflects a relatively sparse research direction within the broader taxonomy of interpretable video classification, suggesting the paper addresses a gap where transformer mechanisms are explicitly combined with concept bottlenecks for temporal reasoning. The framework handles arbitrary-length sequences and models concepts as semantic entities that recur across time, forming 'motifs' that collectively explain actions.

The taxonomy reveals neighboring directions that tackle related but distinct challenges. The sibling leaves 'Disentangled Motion and Context Concepts' and 'Pose-Based Concept Bottlenecks for Action Recognition' focus on specialized concept types (motion-context separation, human pose) rather than general transformer-based temporal aggregation. Nearby branches include 'Automatic Concept Discovery' methods that extract concepts without manual annotation, and 'Post-Hoc' approaches that retrofit interpretability onto pre-trained models. MoTIF's transformer-based temporal modeling distinguishes it from these alternatives by integrating concept reasoning directly into the sequential architecture rather than treating video as static frames or relying on post-hoc extraction.

Among the twenty candidates examined, none clearly refute the three core contributions. The MoTIF framework itself was assessed against six candidates with zero refutable overlaps, the per-channel temporal self-attention mechanism against four candidates with no prior work identified, and the three complementary explanation modes against ten candidates with no refutations found. This limited search scope—covering top semantic matches and citation expansion—suggests that within the examined literature, the specific combination of transformer-based temporal concept bottlenecks with multi-perspective explanations appears novel. However, the analysis does not claim exhaustive coverage of all possible prior work in video understanding or concept-based models.

Based on the top-twenty semantic matches examined, the work appears to occupy a distinct position combining transformer temporal modeling with concept bottlenecks for video. The sparse population of its taxonomy leaf and absence of refutable prior work among examined candidates suggest novelty, though the limited search scope means undiscovered related efforts may exist. The framework's integration of global, local, and temporal concept perspectives differentiates it from post-hoc or static image-based approaches within the surveyed literature.

Taxonomy

Core-task Taxonomy Papers
21
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: interpretable video classification using concept bottleneck models. The field organizes around several complementary directions that address how to build, discover, and apply human-understandable intermediate representations for video understanding. Temporal Concept Bottleneck Architectures focus on capturing dynamic patterns across frames, often leveraging transformer mechanisms to model temporal dependencies among concepts. Automatic Concept Discovery and Extraction methods aim to identify meaningful concepts without exhaustive manual annotation, while Post-Hoc and Lightweight approaches retrofit interpretability onto pre-trained models with minimal overhead. Interactive and Intervention-Enabled frameworks allow users to correct or guide concept predictions at test time, and Concept Dependency modeling explores how concepts relate to one another. Zero-Shot and Transfer Learning branches investigate generalization to new classes or domains via concept-based reasoning, Domain-Specific Applications tailor concept bottlenecks to specialized settings such as medical or affective video analysis, and Model Compression techniques distill concept-based knowledge into efficient architectures. Attribution and Explainability methods provide fine-grained spatial or temporal explanations beyond concept labels alone. Recent work reveals a tension between fully supervised concept annotation and scalable automatic discovery. Some studies pursue post-hoc extraction from black-box models (Post-hoc Concepts[16], Post-hoc Stochastic Concepts[7]) to avoid retraining, while others integrate concept learning end-to-end with temporal architectures (Video Concept Extraction[17], Disentangled Action Concepts[20]). Interactive frameworks (Interactive Concept Bottlenecks[4]) enable human-in-the-loop refinement, contrasting with zero-shot approaches that rely on vision-language alignment (Zero-shot Concept Bottlenecks[1], Open Vocabulary Concepts[11]). Temporal Bottlenecks[0] sits within the transformer-based temporal branch, emphasizing how to propagate and aggregate concept representations over time. Compared to post-hoc methods like DeCoDe[3] or lightweight retrofitting schemes, Temporal Bottlenecks[0] likely adopts a more integrated temporal modeling strategy, aligning closely with works that treat video sequences as first-class citizens rather than static image collections. This positions it among efforts that balance interpretability with the unique challenges of dynamic visual content.

Claimed Contributions

MoTIF framework for video classification with concept bottlenecks

The authors propose MoTIF, a concept bottleneck model architecture designed specifically for video data that can process variable-length sequences. Unlike prior CBMs limited to static images, MoTIF extends the bottleneck principle to temporal sequences using transformer-inspired blocks.

6 retrieved papers
Per-channel temporal self-attention mechanism

The authors introduce a diagonal attention mechanism using depthwise 1×1 convolutions that processes each concept channel independently across time. This design prevents cross-concept mixing while enabling temporal reasoning, maintaining interpretability throughout the model.

4 retrieved papers
Three complementary explanation modes

The framework provides three distinct interpretability views: global concept importance aggregated over entire videos, local concept activations in specific temporal windows, and attention-based temporal dependency maps showing how concepts relate across time.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MoTIF framework for video classification with concept bottlenecks

The authors propose MoTIF, a concept bottleneck model architecture designed specifically for video data that can process variable-length sequences. Unlike prior CBMs limited to static images, MoTIF extends the bottleneck principle to temporal sequences using transformer-inspired blocks.

Contribution

Per-channel temporal self-attention mechanism

The authors introduce a diagonal attention mechanism using depthwise 1×1 convolutions that processes each concept channel independently across time. This design prevents cross-concept mixing while enabling temporal reasoning, maintaining interpretability throughout the model.

Contribution

Three complementary explanation modes

The framework provides three distinct interpretability views: global concept importance aggregated over entire videos, local concept activations in specific temporal windows, and attention-based temporal dependency maps showing how concepts relate across time.