LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Machine LearningAndroid MalwareConcept Drift AnalysisExplainabilityDataset Benchmark

Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift—distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection.

To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013–2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA's utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LAMDA, a large-scale longitudinal Android malware benchmark spanning 12 years and over 1 million samples, designed to facilitate systematic concept drift analysis. Within the taxonomy, it resides in the 'Temporal Evaluation and Benchmarking' leaf under 'Drift Detection and Characterization'. This leaf contains five papers total, indicating a moderately populated research direction. The sibling papers—Empirical Drift Evaluation, Temporal Inconsistency Revisited, Aurora, and one other—similarly focus on measuring model degradation over time, suggesting that temporal benchmarking is an established but not overcrowded subfield within concept drift research.

The taxonomy reveals that LAMDA's leaf sits within a broader branch dedicated to drift detection and characterization, which also includes 'Drift Detection Mechanisms' and 'Drift Cause Analysis'. Neighboring branches address adaptation strategies (active learning, incremental learning, retraining) and robust representation learning (invariant features, domain adaptation). The scope note for LAMDA's leaf explicitly excludes adaptation methods, clarifying that its contribution lies in providing evaluation infrastructure rather than proposing new model update techniques. This positioning suggests the work complements rather than competes with adaptation-focused research, offering a shared resource for testing drift mitigation approaches.

Among the three contributions analyzed, the dataset itself (Contribution A) examined 10 candidates with zero refutable prior work, suggesting strong novelty in scale and temporal scope. However, the empirical demonstration of concept drift (Contribution B) examined 10 candidates and found 6 refutable matches, indicating that performance degradation under temporal shifts is well-documented in prior studies. The multi-faceted analysis framework (Contribution C) examined only 1 candidate with no refutations, though the limited search scope makes it difficult to assess novelty conclusively. Overall, the dataset contribution appears more distinctive than the empirical findings, which align with established observations in the field.

Based on the limited search of 21 candidates, LAMDA's primary novelty lies in its dataset scale and temporal coverage rather than in demonstrating drift effects, which prior work has extensively characterized. The analysis does not cover exhaustive literature beyond top-K semantic matches, so additional related benchmarks or longitudinal studies may exist outside this scope. The contribution's value likely centers on enabling more rigorous comparative evaluations rather than introducing fundamentally new insights about concept drift mechanisms.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: concept drift in Android malware detection over time. The field addresses how malware evolves continuously, causing trained detectors to degrade as new attack patterns emerge and old features become obsolete. The taxonomy organizes research into seven main branches. Drift Detection and Characterization focuses on identifying when and how distribution shifts occur, often through temporal evaluation frameworks and empirical benchmarking studies. Drift Adaptation Strategies encompasses incremental learning, active learning, and domain adaptation methods that update models without full retraining. Robust Representation Learning seeks feature encodings that remain stable across time, while Feature Engineering and Selection refines input signals to minimize sensitivity to evolving malware tactics. Specialized Detection Contexts examines drift in particular settings such as privacy-preserving or federated environments. Malware Evolution and Ecosystem Analysis investigates the underlying causes of drift by studying how malware families, permissions, and app ecosystems change. Emerging Techniques and Future Directions explores novel paradigms including large language models and graph-based embeddings that may offer new resilience against temporal shifts. Several active lines of work reveal key trade-offs between detection accuracy, adaptation speed, and computational overhead. Continuous learning approaches like Continuous Learning Android[4] and Temporal Incremental Learning[7] enable models to absorb new samples incrementally, yet they must balance plasticity against catastrophic forgetting. In contrast, works emphasizing temporal invariance such as Temporal Invariance Android[5] and TESSERACT[25] aim to learn representations that generalize across time windows, reducing the need for frequent updates but potentially sacrificing responsiveness to abrupt shifts. LAMDA[0] sits within the Temporal Evaluation and Benchmarking cluster, alongside Empirical Drift Evaluation[23] and Temporal Inconsistency Revisited[24], providing rigorous experimental protocols to measure drift effects. Compared to Aurora[27], which also emphasizes systematic temporal assessment, LAMDA[0] offers a complementary perspective on how to structure longitudinal experiments and interpret performance degradation patterns, helping researchers understand whether observed drift stems from feature obsolescence, label noise, or adversarial evolution.

Claimed Contributions

LAMDA: A large-scale longitudinal Android malware benchmark dataset

10 retrieved papers

The authors introduce LAMDA, a comprehensive Android malware dataset spanning 12 years (2013–2025, excluding 2015) with over 1 million samples covering 1,380 malware families and 150,000 singleton samples. The dataset is specifically structured to enable systematic evaluation of concept drift in malware detection systems.

10 retrieved papers

Empirical demonstration of concept drift and performance degradation

Can Refute

10 retrieved papers

The authors conduct comprehensive empirical evaluations showing how machine learning-based malware detectors degrade over time due to concept drift. They analyze performance degradation patterns, feature stability, and temporal shifts using multiple evaluation methodologies including supervised learning experiments and distributional analysis.

10 retrieved papers

Can Refute

Multi-faceted concept drift analysis framework

1 retrieved paper

The authors develop and apply a comprehensive analytical framework for studying concept drift that includes multiple complementary techniques: per-feature distribution analysis, family-wise feature stability assessment, temporal label drift tracking, and SHAP-based explanation drift analysis to reveal how feature importance changes over time.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] Empirical Evaluation of Concept Drift in ML-Based Android Malware Detection PDF

Jarrar, Radi, Ahmed Sabbah, Radi Jarrar, Mohaisen, David, S. Zein, David Mohaisen (2025) • arXiv.org

[24] Revisiting Temporal Inconsistency and Feature Extraction for Android Malware Detection PDF

Maryam Tanha, Arunab Singh, Gavin Knoke (2024)

[25] Breaking Out from the TESSERACT: Reassessing ML-based Malware Detection under Spatio-Temporal Drift PDF

Chow, Theo, Linhardt, Lorenz, Arp, Daniel, Cavallaro, Lorenzo, Pierazzi, Fabio (2025)

[27] Aurora: Are Android Malware Classifiers Reliable under Distribution Shift? PDF

Herzog, Alexander, Alexander Herzog, Cavallaro, Lorenzo, Aliai Eusebi, Lorenzo Cavallaro (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LAMDA: A large-scale longitudinal Android malware benchmark dataset

[3] Experts still needed: boosting long-term android malware detection with active learning PDF

Cannot Refute

[4] Continuous Learning for Android Malware Detection PDF

Cannot Refute

[5] Learning Temporal Invariance in Android Malware Detectors PDF

Cannot Refute

[7] Temporal-Incremental Learning for Android Malware Detection PDF

Cannot Refute

[36] Assessing and improving malware detection sustainability through app evolution studies PDF

Cannot Refute

[45] DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift PDF

Cannot Refute

[51] One step forward, two steps back: ML-based malware detection under concept drift PDF

Cannot Refute

[52] FL-MalDrift: a federated learning framework for malware detection under local concept drift PDF

Cannot Refute

[53] On the relativity of time: Implications and challenges of data drift on long-term effective android malware detection PDF

Cannot Refute

[54] LongCGDroid: Android malware detection through longitudinal study for machine learning and deep learning PDF

Cannot Refute

Contribution

Empirical demonstration of concept drift and performance degradation

[5] Learning Temporal Invariance in Android Malware Detectors PDF

Can Refute

[9] Transcend: Detecting concept drift in malware classification models PDF

Can Refute

[25] Breaking Out from the TESSERACT: Reassessing ML-based Malware Detection under Spatio-Temporal Drift PDF

Can Refute

[56] Transcending transcend: Revisiting malware classification in the presence of concept drift PDF

Can Refute

[58] Adapting to concept drift in malware detection PDF

Can Refute

[59] On the limitations of continual learning for malware classification PDF

Can Refute

[10] Hybrid multilevel detection of mobile devices malware under concept drift PDF

Cannot Refute

[18] Towards Explainable Drift Detection and Early Retrain in ML-Based Malware Detection Pipelines PDF

Cannot Refute

[55] Fesad: Ransomware detection with machine learning using adaption to concept drift PDF

Cannot Refute

[57] FeSA: Feature selection architecture for ransomware detection under concept drift PDF

Cannot Refute

Contribution

Multi-faceted concept drift analysis framework

[60] Open Challenges of Malware Detection under Concept Drift PDF

Cannot Refute

LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] Empirical Evaluation of Concept Drift in ML-Based Android Malware Detection PDF

[24] Revisiting Temporal Inconsistency and Feature Extraction for Android Malware Detection PDF

[25] Breaking Out from the TESSERACT: Reassessing ML-based Malware Detection under Spatio-Temporal Drift PDF

[27] Aurora: Are Android Malware Classifiers Reliable under Distribution Shift? PDF

Contribution Analysis

LAMDA: A large-scale longitudinal Android malware benchmark dataset

[3] Experts still needed: boosting long-term android malware detection with active learning PDF

[4] Continuous Learning for Android Malware Detection PDF

[5] Learning Temporal Invariance in Android Malware Detectors PDF

[7] Temporal-Incremental Learning for Android Malware Detection PDF

[36] Assessing and improving malware detection sustainability through app evolution studies PDF

[45] DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift PDF

[51] One step forward, two steps back: ML-based malware detection under concept drift PDF

[52] FL-MalDrift: a federated learning framework for malware detection under local concept drift PDF

[53] On the relativity of time: Implications and challenges of data drift on long-term effective android malware detection PDF

[54] LongCGDroid: Android malware detection through longitudinal study for machine learning and deep learning PDF

Empirical demonstration of concept drift and performance degradation

[5] Learning Temporal Invariance in Android Malware Detectors PDF

[9] Transcend: Detecting concept drift in malware classification models PDF

[25] Breaking Out from the TESSERACT: Reassessing ML-based Malware Detection under Spatio-Temporal Drift PDF

[56] Transcending transcend: Revisiting malware classification in the presence of concept drift PDF

[58] Adapting to concept drift in malware detection PDF

[59] On the limitations of continual learning for malware classification PDF

[10] Hybrid multilevel detection of mobile devices malware under concept drift PDF

[18] Towards Explainable Drift Detection and Early Retrain in ML-Based Malware Detection Pipelines PDF

[55] Fesad: Ransomware detection with machine learning using adaption to concept drift PDF

[57] FeSA: Feature selection architecture for ransomware detection under concept drift PDF

Multi-faceted concept drift analysis framework

[60] Open Challenges of Malware Detection under Concept Drift PDF

Table of Contents