The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Class Incremental LearningContinual LearningEvaluation ProtocolExtreme Class Sequences

Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones, while maintaining stable performance across all possible class sequences. In real-world settings, the order in which classes arrive is diverse and unpredictable, and model performance can vary substantially across different sequences. Yet mainstream evaluation protocols calculate mean and variance from only a small set of randomly sampled sequences. Our theoretical analysis and empirical results demonstrate that this sampling strategy fails to capture the full performance range, resulting in biased mean estimates and a severe underestimation of the true variance in the performance distribution. We therefore contend that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution. To this end, we introduce the concept of extreme sequences and provide theoretical justification for their crucial role in the reliable evaluation of CIL. Moreover, we observe a consistent positive correlation between inter-task similarity and model performance, a relation that can be leveraged to guide the search for extreme sequences. Building on these insights, we propose EDGE (Extreme case–based Distribution & Generalization Evaluation), an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity, offering a closer approximation of the ground-truth performance distribution. Extensive experiments demonstrate that EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries, providing actionable insights for model selection and robustness checking.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper critiques mainstream evaluation protocols in class incremental learning by arguing that random sampling of class sequences produces biased mean estimates and underestimates true performance variance. It resides in the 'Evaluation Protocol Design and Metrics' leaf alongside four sibling papers that similarly examine evaluation scenarios and metrics. This leaf is part of the broader 'Evaluation Methodology and Benchmarking Frameworks' branch, which contains ten papers total across two leaves. The concentration of work in this area suggests active interest in refining how CIL systems are assessed, though the specific focus on sequence-order effects and extreme-case analysis appears less crowded than general protocol design.

The taxonomy reveals neighboring work in 'Benchmark Datasets and Empirical Evaluations' that introduces standardized testbeds like CORe50 and vclimb Benchmark, while the 'Learning Approaches' branch houses algorithmic solutions addressing stability-plasticity trade-offs. The 'Theoretical Foundations' branch includes studies on catastrophic forgetting mechanisms and stability-plasticity analysis. The paper's emphasis on characterizing performance distributions across diverse class orderings connects to theoretical concerns about forgetting but diverges from purely algorithmic innovations. Its methodological critique bridges evaluation design and theoretical understanding of how task sequences influence learning dynamics.

Among twenty candidates examined, none clearly refute the three main contributions. The analysis of random sampling limitations examined seven candidates with zero refutable matches, the extreme sequences concept examined three with none refutable, and the EDGE protocol examined ten with none refutable. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of theoretical critique, extreme-sequence formalization, and guided search protocol appears distinctive. However, the modest search scale means potentially relevant prior work on class-ordering effects or variance estimation may exist beyond the examined candidates.

Based on the limited literature search, the work appears to occupy a relatively sparse niche within evaluation methodology, addressing sequence-order bias through a combination of theoretical analysis and protocol design. The absence of refutable candidates across twenty examined papers suggests novelty in the specific framing, though the search scope does not guarantee exhaustive coverage of related evaluation critiques or ordering-effect studies in continual learning.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluation protocols for class incremental learning. The field organizes around four main branches that together capture how researchers design, implement, and analyze systems that learn new classes over time without forgetting old ones. Evaluation Methodology and Benchmarking Frameworks focuses on protocol design, metrics, and standardized testbeds such as CORe50[44] and vclimb Benchmark[15], establishing how performance should be measured across incremental tasks. Learning Approaches and Algorithmic Solutions encompasses the diverse algorithmic strategies—ranging from memory replay methods like GDumb[38] and RMM Memory Management[7] to representation-based techniques such as DER Expandable Representation[5]—that tackle the stability-plasticity trade-off. Specialized Incremental Learning Settings addresses domain-specific challenges in federated scenarios (Federated Class Incremental[1]), few-shot regimes (MetaFSCIL[11]), and resource-constrained environments (TinyML Architectures[14]). Theoretical Foundations and Analysis provides the conceptual underpinnings, examining phenomena like catastrophic forgetting and the principles that guide effective continual learning. Recent work has intensified debate over whether standard evaluation practices accurately reflect real-world performance, with some studies questioning class ordering effects (Class Orderings[30]) and others exploring multi-phase task structures (Multi-phase Tasks[46]). The original paper, Lie of Average[0], sits squarely within the Evaluation Methodology branch alongside works like Three Types Incremental[2] and Three Scenarios Continual[4], which dissect different incremental learning scenarios. While Learning from Yourself[43] and Representation Learning Perspective[47] emphasize algorithmic innovations in representation quality, Lie of Average[0] critiques how averaging-based metrics may obscure important performance dynamics across tasks. This methodological focus contrasts with purely algorithmic contributions, positioning the work as a call for more nuanced evaluation standards that better capture the complexities of incremental learning trajectories.

Claimed Contributions

Theoretical and empirical analysis of Random Sampling protocol limitations

7 retrieved papers

The authors provide theoretical proofs (Lemma 1, Theorem 1) and empirical evidence showing that the widely-used Random Sampling (RS) evaluation protocol in CIL produces biased estimates of mean performance and severely underestimates variance, failing to capture the true performance distribution across different class sequences.

7 retrieved papers

Concept of extreme sequences with theoretical justification

3 retrieved papers

The authors introduce extreme sequences (hardest and easiest class orderings) as a key concept for CIL evaluation and provide theoretical analysis (Theorem 2) demonstrating that incorporating extreme sequences significantly reduces the sample size needed for accurate performance distribution estimation compared to uniform random sampling.

3 retrieved papers

EDGE evaluation protocol

10 retrieved papers

The authors propose EDGE, a novel evaluation framework that leverages inter-task similarity computed from CLIP-encoded class labels to adaptively generate extreme (easy and hard) class sequences, providing more accurate estimates of the true performance distribution than existing random sampling approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Three types of incremental learning PDF

Tuytelaars, Tinne, Gido M. van de Ven, Tolias, Andreas S., T. Tuytelaars, van de Ven, Gido M., A. Tolias (2022)

[4] Three scenarios for continual learning PDF

van de Ven, Gido M., Tolias, Andreas S., Gido M. van de Ven, A. Tolias (2022)

[43] Rethinking Few-shot Class-incremental Learning: Learning from Yourself PDF

Tang Yu-ming, Peng Yi-xing, Yuyao Tang, Meng Jing-Ke, Yi-Xing Peng, Zheng, Wei-Shi, Jingke Meng, Wei-Shi Zheng (2024) • European Conference on Computer Vision

[47] Towards More Diverse Evaluation of Class Incremental Learning: Representation Learning Perspective PDF

Sungmin Cha, Jihwan Kwak, Dongsub Shim, Hyunwoo Kim, Moontae Lee, Honglak Lee, Taesup Moon (2024) • CoLLAs

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical and empirical analysis of Random Sampling protocol limitations

[64] Online continual learning from imbalanced data PDF

Cannot Refute

[65] Addressing Challenges for Reliable Machine Learning Model Updates PDF

Cannot Refute

[66] The Principles of Learning on Multiple Tasks PDF

Cannot Refute

[67] Active continual learning with Energy Alignment Sampling Strategy (EASS) for structural damage classification PDF

Cannot Refute

[68] Bias-Corrected Estimation in Continuous Sampling Plans. PDF

Cannot Refute

[69] Adaptive Self-Organizing Clustering Dual-Buffer Safe Reinforcement Learning for Nonlinear Optimal Control PDF

Cannot Refute

[70] Quantifying Uncertainty in Deep Learning: Methods, Challenges, and Future Directions PDF

Cannot Refute

Contribution

Concept of extreme sequences with theoretical justification

[61] Order-robust class incremental learning: Graph-driven dynamic similarity grouping PDF

Cannot Refute

[62] Lifelong Machine Learning Without Lifelong Data Retention PDF

Cannot Refute

[63] Design and Evaluation for Robust Continual Learning PDF

Cannot Refute

Contribution

EDGE evaluation protocol

[51] A Methodological Framework for Multistrategy Task-adaptive Learning PDF

Cannot Refute

[52] InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions PDF

Cannot Refute

[53] SWE-Bench-CL: Continual Learning for Coding Agents PDF

Cannot Refute

[54] Similarity-Based Adaptation for Task-Aware and Task-Free Continual Learning PDF

Cannot Refute

[55] Task relation-aware continual user representation learning PDF

Cannot Refute

[56] Incremental Learning of Retrievable Skills For Efficient Continual Task Adaptation PDF

Cannot Refute

[57] Order parameters and phase transitions of continual learning in deep neural networks PDF

Cannot Refute

[58] Expert gate: Lifelong learning with a network of experts PDF

Cannot Refute

[59] Accounting for the effect of inter-task similarity in continual learning models PDF

Cannot Refute

[60] Relational experience replay: Continual learning by adaptively tuning task-wise relationship PDF

Cannot Refute

The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Three types of incremental learning PDF

[4] Three scenarios for continual learning PDF

[43] Rethinking Few-shot Class-incremental Learning: Learning from Yourself PDF

[47] Towards More Diverse Evaluation of Class Incremental Learning: Representation Learning Perspective PDF

Contribution Analysis

Theoretical and empirical analysis of Random Sampling protocol limitations

[64] Online continual learning from imbalanced data PDF

[65] Addressing Challenges for Reliable Machine Learning Model Updates PDF

[66] The Principles of Learning on Multiple Tasks PDF

[67] Active continual learning with Energy Alignment Sampling Strategy (EASS) for structural damage classification PDF

[68] Bias-Corrected Estimation in Continuous Sampling Plans. PDF

[69] Adaptive Self-Organizing Clustering Dual-Buffer Safe Reinforcement Learning for Nonlinear Optimal Control PDF

[70] Quantifying Uncertainty in Deep Learning: Methods, Challenges, and Future Directions PDF

Concept of extreme sequences with theoretical justification

[61] Order-robust class incremental learning: Graph-driven dynamic similarity grouping PDF

[62] Lifelong Machine Learning Without Lifelong Data Retention PDF

[63] Design and Evaluation for Robust Continual Learning PDF

EDGE evaluation protocol

[51] A Methodological Framework for Multistrategy Task-adaptive Learning PDF

[52] InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions PDF

[53] SWE-Bench-CL: Continual Learning for Coding Agents PDF

[54] Similarity-Based Adaptation for Task-Aware and Task-Free Continual Learning PDF

[55] Task relation-aware continual user representation learning PDF

[56] Incremental Learning of Retrievable Skills For Efficient Continual Task Adaptation PDF

[57] Order parameters and phase transitions of continual learning in deep neural networks PDF

[58] Expert gate: Lifelong learning with a network of experts PDF

[59] Accounting for the effect of inter-task similarity in continual learning models PDF

[60] Relational experience replay: Continual learning by adaptively tuning task-wise relationship PDF

Table of Contents