LAMDA: A Longitudinal Android Malware Benchmark for Concept Drift Analysis

ICLR 2026 Conference SubmissionAnonymous Authors
Machine LearningAndroid MalwareConcept Drift AnalysisExplainabilityDataset Benchmark
Abstract:

Machine learning (ML)-based malware detection systems often fail to account for the dynamic nature of real-world training and test data distributions. In practice, these distributions evolve due to frequent changes in the Android ecosystem, adversarial development of new malware families, and the continuous emergence of both benign and malicious applications. Prior studies have shown that such concept drift—distributional shifts in benign and malicious samples, leads to significant degradation in detection performance over time. Despite the practical importance of this issue, existing datasets are often outdated and limited in temporal scope, diversity of malware families, and sample scale, making them insufficient for the systematic evaluation of concept drift in malware detection.

To address this gap, we present LAMDA, the largest and most temporally diverse Android malware benchmark to date, designed specifically for concept drift analysis. LAMDA spans 12 years (2013–2025, excluding 2015), includes over 1 million samples (approximately 37% labeled as malware), and covers 1,380 malware families and 150,000 singleton samples, reflecting the natural distribution and evolution of real-world Android applications. We empirically demonstrate LAMDA's utility by quantifying the performance degradation of standard ML models over time and analyzing feature stability across years. As the most comprehensive Android malware dataset to date, LAMDA enables in-depth research into temporal drift, generalization, explainability, and evolving detection challenges.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces LAMDA, a large-scale longitudinal Android malware benchmark spanning 12 years and over 1 million samples, designed to facilitate systematic concept drift analysis. Within the taxonomy, it resides in the 'Temporal Evaluation and Benchmarking' leaf under 'Drift Detection and Characterization'. This leaf contains five papers total, indicating a moderately populated research direction. The sibling papers—Empirical Drift Evaluation, Temporal Inconsistency Revisited, Aurora, and one other—similarly focus on measuring model degradation over time, suggesting that temporal benchmarking is an established but not overcrowded subfield within concept drift research.

The taxonomy reveals that LAMDA's leaf sits within a broader branch dedicated to drift detection and characterization, which also includes 'Drift Detection Mechanisms' and 'Drift Cause Analysis'. Neighboring branches address adaptation strategies (active learning, incremental learning, retraining) and robust representation learning (invariant features, domain adaptation). The scope note for LAMDA's leaf explicitly excludes adaptation methods, clarifying that its contribution lies in providing evaluation infrastructure rather than proposing new model update techniques. This positioning suggests the work complements rather than competes with adaptation-focused research, offering a shared resource for testing drift mitigation approaches.

Among the three contributions analyzed, the dataset itself (Contribution A) examined 10 candidates with zero refutable prior work, suggesting strong novelty in scale and temporal scope. However, the empirical demonstration of concept drift (Contribution B) examined 10 candidates and found 6 refutable matches, indicating that performance degradation under temporal shifts is well-documented in prior studies. The multi-faceted analysis framework (Contribution C) examined only 1 candidate with no refutations, though the limited search scope makes it difficult to assess novelty conclusively. Overall, the dataset contribution appears more distinctive than the empirical findings, which align with established observations in the field.

Based on the limited search of 21 candidates, LAMDA's primary novelty lies in its dataset scale and temporal coverage rather than in demonstrating drift effects, which prior work has extensively characterized. The analysis does not cover exhaustive literature beyond top-K semantic matches, so additional related benchmarks or longitudinal studies may exist outside this scope. The contribution's value likely centers on enabling more rigorous comparative evaluations rather than introducing fundamentally new insights about concept drift mechanisms.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
6
Refutable Paper

Research Landscape Overview

Core task: concept drift in Android malware detection over time. The field addresses how malware evolves continuously, causing trained detectors to degrade as new attack patterns emerge and old features become obsolete. The taxonomy organizes research into seven main branches. Drift Detection and Characterization focuses on identifying when and how distribution shifts occur, often through temporal evaluation frameworks and empirical benchmarking studies. Drift Adaptation Strategies encompasses incremental learning, active learning, and domain adaptation methods that update models without full retraining. Robust Representation Learning seeks feature encodings that remain stable across time, while Feature Engineering and Selection refines input signals to minimize sensitivity to evolving malware tactics. Specialized Detection Contexts examines drift in particular settings such as privacy-preserving or federated environments. Malware Evolution and Ecosystem Analysis investigates the underlying causes of drift by studying how malware families, permissions, and app ecosystems change. Emerging Techniques and Future Directions explores novel paradigms including large language models and graph-based embeddings that may offer new resilience against temporal shifts. Several active lines of work reveal key trade-offs between detection accuracy, adaptation speed, and computational overhead. Continuous learning approaches like Continuous Learning Android[4] and Temporal Incremental Learning[7] enable models to absorb new samples incrementally, yet they must balance plasticity against catastrophic forgetting. In contrast, works emphasizing temporal invariance such as Temporal Invariance Android[5] and TESSERACT[25] aim to learn representations that generalize across time windows, reducing the need for frequent updates but potentially sacrificing responsiveness to abrupt shifts. LAMDA[0] sits within the Temporal Evaluation and Benchmarking cluster, alongside Empirical Drift Evaluation[23] and Temporal Inconsistency Revisited[24], providing rigorous experimental protocols to measure drift effects. Compared to Aurora[27], which also emphasizes systematic temporal assessment, LAMDA[0] offers a complementary perspective on how to structure longitudinal experiments and interpret performance degradation patterns, helping researchers understand whether observed drift stems from feature obsolescence, label noise, or adversarial evolution.

Claimed Contributions

LAMDA: A large-scale longitudinal Android malware benchmark dataset

The authors introduce LAMDA, a comprehensive Android malware dataset spanning 12 years (2013–2025, excluding 2015) with over 1 million samples covering 1,380 malware families and 150,000 singleton samples. The dataset is specifically structured to enable systematic evaluation of concept drift in malware detection systems.

10 retrieved papers
Empirical demonstration of concept drift and performance degradation

The authors conduct comprehensive empirical evaluations showing how machine learning-based malware detectors degrade over time due to concept drift. They analyze performance degradation patterns, feature stability, and temporal shifts using multiple evaluation methodologies including supervised learning experiments and distributional analysis.

10 retrieved papers
Can Refute
Multi-faceted concept drift analysis framework

The authors develop and apply a comprehensive analytical framework for studying concept drift that includes multiple complementary techniques: per-feature distribution analysis, family-wise feature stability assessment, temporal label drift tracking, and SHAP-based explanation drift analysis to reveal how feature importance changes over time.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LAMDA: A large-scale longitudinal Android malware benchmark dataset

The authors introduce LAMDA, a comprehensive Android malware dataset spanning 12 years (2013–2025, excluding 2015) with over 1 million samples covering 1,380 malware families and 150,000 singleton samples. The dataset is specifically structured to enable systematic evaluation of concept drift in malware detection systems.

Contribution

Empirical demonstration of concept drift and performance degradation

The authors conduct comprehensive empirical evaluations showing how machine learning-based malware detectors degrade over time due to concept drift. They analyze performance degradation patterns, feature stability, and temporal shifts using multiple evaluation methodologies including supervised learning experiments and distributional analysis.

Contribution

Multi-faceted concept drift analysis framework

The authors develop and apply a comprehensive analytical framework for studying concept drift that includes multiple complementary techniques: per-feature distribution analysis, family-wise feature stability assessment, temporal label drift tracking, and SHAP-based explanation drift analysis to reveal how feature importance changes over time.