Unlocking Full Efficiency of Token Filtering in Large Language Model Training

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient LLM Training; Token Filtering;
Abstract:

Token filtering has been proposed to enhance the utility of large language models (LLMs) by eliminating inconsequential tokens during training. While using fewer tokens is expected to reduce computational workloads, existing methods have not yet achieved a real-world efficiency boost. This is primarily due to two factors: (1) existing work has inadequate sparsity for speedup, and (2) token filtering operates within a sparsity range that is non-standard in existing machine learning (ML) libraries and thus cannot be efficiently supported. This paper presents Centrifuge, a system that leverages algorithm and system co-design to unleash the full efficiency of token filtering in LLM training. At the algorithm level, Centrifuge filters activations of inconsequential tokens in the attention backward kernel to amplify the sparsity in backward computation. At the system level, Centrifuge proposes an automatic workflow that transforms sparse GEMM into dimension-reduced dense GEMM for optimized efficiency using standard ML libraries. Evaluations on models with various scales—from 1.1B to 40B—demonstrate that Centrifuge reduces backpropagation time by up to 49.9% and end-to-end training time by up to 34.7% when filtering 50% of tokens. Utility assessments indicate that Centrifuge preserves the utility benefits of token filtering and significantly enhances model performance by up to 26.6% compared to standard training. Centrifuge is designed for seamless integration into existing LLM training frameworks, enabling systems already utilizing token filtering to accelerate training with just one line of code.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Centrifuge, a system that combines algorithmic sparsity amplification in attention backward kernels with system-level transformations to achieve real-world training speedup through token filtering. It resides in the Token Pruning and Sparsification for Training leaf, which contains only three papers total, including two siblings: Collider Token Filtering and Dynamic Token Pruning. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that practical systems for training-time token pruning remain an emerging area compared to more crowded domains like data quality filtering or inference-time reduction.

The taxonomy reveals that token filtering for training efficiency sits alongside two sibling categories: Selective Token Training and Loss Weighting (four papers focusing on differential weighting rather than elimination) and Adaptive Token Selection Mechanisms (four papers emphasizing dynamic, context-driven selection). Neighboring branches address orthogonal concerns—Data Filtering Strategies operates at the document level before training begins, while Inference-Time Token Reduction targets deployed models. The scope note for this leaf explicitly excludes inference-only pruning and loss-based weighting, positioning Centrifuge's backward-kernel sparsification as distinct from gradient reweighting approaches and runtime compression techniques.

Among fourteen candidates examined, the algorithm-level contribution (sparsity amplification in attention backward) shows no clear refutation across one candidate reviewed. The system-level transformation (sparse-to-dense GEMM conversion) encountered one refutable candidate among three examined, indicating some prior exploration of similar optimization strategies. The integrated Centrifuge system faced one refutable candidate among ten reviewed, suggesting that while individual components may overlap with existing work, the co-design approach combining both levels has limited direct precedent within the search scope. The statistics reflect a focused but not exhaustive literature review.

Given the limited search scale and the sparse population of the taxonomy leaf, the work appears to address a recognized gap—achieving measurable training speedup from token filtering—where prior methods have struggled with inadequate sparsity or library incompatibility. The analysis does not cover the full landscape of training optimization or all possible sparsification techniques, but within the examined scope, the integration of backward-kernel filtering with automatic GEMM transformation represents a relatively underexplored direction in training-time token pruning.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Efficient token filtering in large language model training. The field addresses how to reduce computational costs and improve data quality during LLM development by selectively processing or removing tokens. The taxonomy reveals several major branches: Token Filtering and Selection for Training Efficiency focuses on methods that prune or weight tokens during the training phase itself, often through gradient-based or attention-driven selection (e.g., Collider Token Filtering[27], Dynamic Token Pruning[49]). Data Filtering Strategies for Pretraining Quality emphasizes curating high-quality pretraining corpora by filtering documents or sequences before training begins (e.g., Superfiltering[6], Ultra-fineweb[10]). Inference-Time Token Reduction for Multimodal and Long-Context Models targets runtime efficiency by compressing visual or textual tokens in deployed models (e.g., Prunevid[3], LLaMA-VID[2]). Additional branches cover Token-Level Mechanisms and Architectural Innovations (e.g., Thinking Tokens[28], Pause Tokens[25]), Tokenization and Token-Level Analysis (e.g., Tokenizer Choice[44]), and Watermarking and Security (e.g., Watermark LLMs[1]). Within the training efficiency branch, a central tension emerges between static data filtering approaches that preselect valuable examples (e.g., Data Pruning Pretraining[16], Fishing for Magikarp[20]) and dynamic token-level pruning that adapts during training (e.g., Tokenselect[39], Token Weighting[48]). Token Filtering Efficiency[0] sits squarely in the Token Pruning and Sparsification for Training cluster, closely aligned with works like Collider Token Filtering[27] and Dynamic Token Pruning[49] that reduce per-step computation by identifying and discarding less informative tokens on-the-fly. Compared to Collider Token Filtering[27], which emphasizes gradient-based selection, Token Filtering Efficiency[0] may explore alternative scoring mechanisms or integration strategies. Meanwhile, Dynamic Token Pruning[49] shares the goal of adaptive sparsification but may differ in architectural assumptions or pruning schedules. These methods collectively aim to bridge the gap between expensive full-token training and the need for scalable, high-quality LLM development.

Claimed Contributions

Algorithm for amplifying sparsity in attention backward kernel

The authors propose a novel attention backward kernel that separately processes gradient outputs to filter activations of filtered tokens while maintaining compatibility with memory-efficient attention implementations like FlashAttention. This design amplifies sparsity throughout the backward pass without causing gradient interference.

1 retrieved paper
System-level transformation of sparse GEMM to dimension-reduced dense GEMM

The authors design an automatic workflow that leverages runtime stability to dynamically identify and update computation graph dimensions and variables, transforming sparse matrix operations into dimension-reduced dense operations that can be efficiently executed by existing machine learning libraries.

3 retrieved papers
Can Refute
CENTRIFUGE system integrating algorithm and system co-design

The authors present CENTRIFUGE as an integrated system combining algorithmic innovations in sparsity amplification with system-level optimizations for efficient computation. The system is designed for seamless integration into existing LLM training frameworks, requiring only one line of code for systems already using token filtering.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Algorithm for amplifying sparsity in attention backward kernel

The authors propose a novel attention backward kernel that separately processes gradient outputs to filter activations of filtered tokens while maintaining compatibility with memory-efficient attention implementations like FlashAttention. This design amplifies sparsity throughout the backward pass without causing gradient interference.

Contribution

System-level transformation of sparse GEMM to dimension-reduced dense GEMM

The authors design an automatic workflow that leverages runtime stability to dynamically identify and update computation graph dimensions and variables, transforming sparse matrix operations into dimension-reduced dense operations that can be efficiently executed by existing machine learning libraries.

Contribution

CENTRIFUGE system integrating algorithm and system co-design

The authors present CENTRIFUGE as an integrated system combining algorithmic innovations in sparsity amplification with system-level optimizations for efficient computation. The system is designed for seamless integration into existing LLM training frameworks, requiring only one line of code for systems already using token filtering.