The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?

ICLR 2026 Conference SubmissionAnonymous Authors
Class Incremental LearningContinual LearningEvaluation ProtocolExtreme Class Sequences
Abstract:

Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones, while maintaining stable performance across all possible class sequences. In real-world settings, the order in which classes arrive is diverse and unpredictable, and model performance can vary substantially across different sequences. Yet mainstream evaluation protocols calculate mean and variance from only a small set of randomly sampled sequences. Our theoretical analysis and empirical results demonstrate that this sampling strategy fails to capture the full performance range, resulting in biased mean estimates and a severe underestimation of the true variance in the performance distribution. We therefore contend that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution. To this end, we introduce the concept of extreme sequences and provide theoretical justification for their crucial role in the reliable evaluation of CIL. Moreover, we observe a consistent positive correlation between inter-task similarity and model performance, a relation that can be leveraged to guide the search for extreme sequences. Building on these insights, we propose EDGE (Extreme case–based Distribution & Generalization Evaluation), an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity, offering a closer approximation of the ground-truth performance distribution. Extensive experiments demonstrate that EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries, providing actionable insights for model selection and robustness checking.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper critiques mainstream evaluation protocols in class incremental learning by arguing that random sampling of class sequences produces biased mean estimates and underestimates true performance variance. It resides in the 'Evaluation Protocol Design and Metrics' leaf alongside four sibling papers that similarly examine evaluation scenarios and metrics. This leaf is part of the broader 'Evaluation Methodology and Benchmarking Frameworks' branch, which contains ten papers total across two leaves. The concentration of work in this area suggests active interest in refining how CIL systems are assessed, though the specific focus on sequence-order effects and extreme-case analysis appears less crowded than general protocol design.

The taxonomy reveals neighboring work in 'Benchmark Datasets and Empirical Evaluations' that introduces standardized testbeds like CORe50 and vclimb Benchmark, while the 'Learning Approaches' branch houses algorithmic solutions addressing stability-plasticity trade-offs. The 'Theoretical Foundations' branch includes studies on catastrophic forgetting mechanisms and stability-plasticity analysis. The paper's emphasis on characterizing performance distributions across diverse class orderings connects to theoretical concerns about forgetting but diverges from purely algorithmic innovations. Its methodological critique bridges evaluation design and theoretical understanding of how task sequences influence learning dynamics.

Among twenty candidates examined, none clearly refute the three main contributions. The analysis of random sampling limitations examined seven candidates with zero refutable matches, the extreme sequences concept examined three with none refutable, and the EDGE protocol examined ten with none refutable. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of theoretical critique, extreme-sequence formalization, and guided search protocol appears distinctive. However, the modest search scale means potentially relevant prior work on class-ordering effects or variance estimation may exist beyond the examined candidates.

Based on the limited literature search, the work appears to occupy a relatively sparse niche within evaluation methodology, addressing sequence-order bias through a combination of theoretical analysis and protocol design. The absence of refutable candidates across twenty examined papers suggests novelty in the specific framing, though the search scope does not guarantee exhaustive coverage of related evaluation critiques or ordering-effect studies in continual learning.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluation protocols for class incremental learning. The field organizes around four main branches that together capture how researchers design, implement, and analyze systems that learn new classes over time without forgetting old ones. Evaluation Methodology and Benchmarking Frameworks focuses on protocol design, metrics, and standardized testbeds such as CORe50[44] and vclimb Benchmark[15], establishing how performance should be measured across incremental tasks. Learning Approaches and Algorithmic Solutions encompasses the diverse algorithmic strategies—ranging from memory replay methods like GDumb[38] and RMM Memory Management[7] to representation-based techniques such as DER Expandable Representation[5]—that tackle the stability-plasticity trade-off. Specialized Incremental Learning Settings addresses domain-specific challenges in federated scenarios (Federated Class Incremental[1]), few-shot regimes (MetaFSCIL[11]), and resource-constrained environments (TinyML Architectures[14]). Theoretical Foundations and Analysis provides the conceptual underpinnings, examining phenomena like catastrophic forgetting and the principles that guide effective continual learning. Recent work has intensified debate over whether standard evaluation practices accurately reflect real-world performance, with some studies questioning class ordering effects (Class Orderings[30]) and others exploring multi-phase task structures (Multi-phase Tasks[46]). The original paper, Lie of Average[0], sits squarely within the Evaluation Methodology branch alongside works like Three Types Incremental[2] and Three Scenarios Continual[4], which dissect different incremental learning scenarios. While Learning from Yourself[43] and Representation Learning Perspective[47] emphasize algorithmic innovations in representation quality, Lie of Average[0] critiques how averaging-based metrics may obscure important performance dynamics across tasks. This methodological focus contrasts with purely algorithmic contributions, positioning the work as a call for more nuanced evaluation standards that better capture the complexities of incremental learning trajectories.

Claimed Contributions

Theoretical and empirical analysis of Random Sampling protocol limitations

The authors provide theoretical proofs (Lemma 1, Theorem 1) and empirical evidence showing that the widely-used Random Sampling (RS) evaluation protocol in CIL produces biased estimates of mean performance and severely underestimates variance, failing to capture the true performance distribution across different class sequences.

7 retrieved papers
Concept of extreme sequences with theoretical justification

The authors introduce extreme sequences (hardest and easiest class orderings) as a key concept for CIL evaluation and provide theoretical analysis (Theorem 2) demonstrating that incorporating extreme sequences significantly reduces the sample size needed for accurate performance distribution estimation compared to uniform random sampling.

3 retrieved papers
EDGE evaluation protocol

The authors propose EDGE, a novel evaluation framework that leverages inter-task similarity computed from CLIP-encoded class labels to adaptively generate extreme (easy and hard) class sequences, providing more accurate estimates of the true performance distribution than existing random sampling approaches.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical and empirical analysis of Random Sampling protocol limitations

The authors provide theoretical proofs (Lemma 1, Theorem 1) and empirical evidence showing that the widely-used Random Sampling (RS) evaluation protocol in CIL produces biased estimates of mean performance and severely underestimates variance, failing to capture the true performance distribution across different class sequences.

Contribution

Concept of extreme sequences with theoretical justification

The authors introduce extreme sequences (hardest and easiest class orderings) as a key concept for CIL evaluation and provide theoretical analysis (Theorem 2) demonstrating that incorporating extreme sequences significantly reduces the sample size needed for accurate performance distribution estimation compared to uniform random sampling.

Contribution

EDGE evaluation protocol

The authors propose EDGE, a novel evaluation framework that leverages inter-task similarity computed from CLIP-encoded class labels to adaptively generate extreme (easy and hard) class sequences, providing more accurate estimates of the true performance distribution than existing random sampling approaches.