SlotFM: A Motion Foundation Model with Slot Attention for Diverse Downstream Tasks

ICLR 2026 Conference SubmissionAnonymous Authors
accelerometerIMUfoundation modelsself-supervised learningtime-seriesslot attention
Abstract:

Wearable accelerometers are used for a wide range of applications, such as gesture recognition, gait analysis, and sports monitoring. Yet most existing foundation models focus primarily on classifying common daily activities such as locomotion and exercise, limiting their applicability to the broader range of tasks that rely on other signal characteristics. We present SlotFM, an accelerometer foundation model that generalizes across diverse downstream tasks. SlotFM uses Time-Frequency Slot Attention, an extension of Slot Attention that processes both time and frequency representations of the raw signals. It generates multiple small embeddings (slots), each capturing different signal components, enabling task-specific heads to focus on the most relevant parts of the data. We also introduce two loss regularizers that capture local structure and frequency patterns, which improve reconstruction of fine-grained details and helps the embeddings preserve task-relevant information. We evaluate SlotFM on 16 classification and regression downstream tasks that extend beyond standard human activity recognition. It outperforms existing self-supervised approaches on 13 of these tasks and achieves comparable results to the best performing approaches on the remaining tasks. On average, our method yields a 4.5% performance gain, demonstrating strong generalization for sensing foundation models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SlotFM introduces a foundation model for accelerometer data using Time-Frequency Slot Attention to generate multiple small embeddings capturing different signal components. The paper sits in the General Self-Supervised Pretraining leaf, which contains only three papers total including SlotFM itself. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting that self-supervised foundation models for accelerometer data remain an emerging area compared to more established branches like supervised classification or transfer learning.

The taxonomy reveals that SlotFM's closest neighbors are other self-supervised approaches within the same parent branch, while adjacent leaves focus on physiological signal coupling or unsupervised clustering. The broader Transfer Learning and Domain Adaptation branch (containing 10 papers across four leaves) addresses generalization through explicit domain alignment strategies, whereas SlotFM pursues generalization through task-agnostic pretraining. The Supervised Feature Learning branch (11 papers) represents the traditional paradigm of end-to-end architectures for labeled data, from which SlotFM diverges by learning representations without activity labels.

Among 20 candidates examined across three contributions, no clearly refuting prior work was identified. The Time-Frequency Slot Attention mechanism was assessed against 8 candidates with no refutations found, the two loss regularizers against 2 candidates, and the foundation model benchmark against 10 candidates. This limited search scope suggests that within the top-20 semantically similar papers, no work appears to provide substantial overlap with SlotFM's specific technical approach. However, the small candidate pool means the analysis cannot rule out relevant prior work beyond these top matches.

Based on the limited literature search of 20 candidates, SlotFM appears to occupy a relatively novel position combining slot-based attention with time-frequency processing for accelerometer foundation models. The sparse population of its taxonomy leaf and absence of refuting candidates within the examined scope suggest distinctiveness, though the analysis does not cover the full breadth of self-supervised learning or attention mechanism literature beyond accelerometer-specific applications.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Learning generalizable accelerometer representations across diverse motion tasks. The field organizes around six main branches that reflect different methodological emphases and application contexts. Self-Supervised and Unsupervised Representation Learning explores pretraining strategies that do not require labeled data, enabling models to capture general motion patterns before fine-tuning on specific tasks. Transfer Learning and Domain Adaptation focuses on adapting models trained in one setting to new domains, devices, or user populations, addressing the challenge of distribution shift across datasets. Supervised Feature Learning and Classification develops end-to-end architectures and handcrafted features for labeled activity data, while Application-Specific Activity Recognition targets specialized domains such as clinical monitoring, animal behavior, or industrial settings. Signal Processing and Sensor Methodology examines foundational issues like sensor placement, orientation estimation, and signal preprocessing. Finally, Datasets, Benchmarks, and Methodological Reviews provide the empirical infrastructure and comparative analyses that guide the field's progress. Recent work highlights a tension between domain-specific tuning and broadly generalizable representations. Many studies in transfer learning, such as CrossHAR[3] and Cross-domain HAR[2], tackle cross-dataset or cross-device generalization by aligning feature distributions or leveraging domain adaptation techniques. Meanwhile, self-supervised pretraining approaches like Self-supervised IMU[47] and Accelerometer Foundation Models[11] aim to learn universal motion embeddings that transfer widely without extensive labeled data. SlotFM[0] sits within the General Self-Supervised Pretraining cluster, emphasizing the discovery of reusable motion primitives through unsupervised methods. Compared to CrossHAR[3], which explicitly addresses domain shift via adversarial or alignment strategies, SlotFM[0] focuses on learning compositional representations that generalize by capturing fundamental motion structures. This contrast underscores an open question: whether generalization is best achieved through explicit domain adaptation or through richer, task-agnostic pretraining that naturally transfers across contexts.

Claimed Contributions

Time-Frequency Slot Attention for accelerometer signals

The authors introduce an extension of Slot Attention that processes accelerometer data in both time and frequency domains. It generates multiple slot vectors that each capture different signal components, enabling task-specific heads to focus on relevant features across diverse downstream tasks.

8 retrieved papers
Two loss regularizers for improved signal reconstruction

The authors introduce SSIM (Structural Similarity Index Measure) and MS-STFT (Multi-Scale Short-Term Fourier Transform) as loss regularizers. These losses encourage the model to preserve structural patterns and high-frequency details in the signal reconstruction, improving downstream task performance.

2 retrieved papers
SlotFM foundation model and diverse downstream benchmark

The authors train and release SlotFM, an accelerometer foundation model, and evaluate it on 16 classification and regression tasks spanning gestures, sports, cooking, and transportation. They also release code for model training and benchmark setup to support reproducibility.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Time-Frequency Slot Attention for accelerometer signals

The authors introduce an extension of Slot Attention that processes accelerometer data in both time and frequency domains. It generates multiple slot vectors that each capture different signal components, enabling task-specific heads to focus on relevant features across diverse downstream tasks.

Contribution

Two loss regularizers for improved signal reconstruction

The authors introduce SSIM (Structural Similarity Index Measure) and MS-STFT (Multi-Scale Short-Term Fourier Transform) as loss regularizers. These losses encourage the model to preserve structural patterns and high-frequency details in the signal reconstruction, improving downstream task performance.

Contribution

SlotFM foundation model and diverse downstream benchmark

The authors train and release SlotFM, an accelerometer foundation model, and evaluate it on 16 classification and regression tasks spanning gestures, sports, cooking, and transportation. They also release code for model training and benchmark setup to support reproducibility.