ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsSupervised Fine-tuningData Selection
Abstract:

Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose ssToken, a Self-modulated and Semantic-aware Token Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integration—ssToken—achieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency. Source code is available at https://anonymous.4open.science/r/Submission2116-B7C5.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ssToken, a token selection method that combines self-modulated loss differences with attention-based semantic importance. It resides in the Loss-based Token Selection leaf, which contains four papers including the original work. This leaf sits within the broader Token-level Selection Methods branch, indicating a moderately populated research direction focused on fine-grained data curation. The taxonomy shows that loss-based approaches represent one of several competing paradigms for token selection, alongside semantic methods and critical token identification, suggesting this is an active but not overcrowded subfield.

The taxonomy reveals neighboring leaves for Semantic and Attention-based Selection (three papers) and Token Weighting schemes (two papers), indicating that ssToken bridges multiple methodological traditions. The scope note for Loss-based Token Selection explicitly excludes purely semantic approaches, yet ssToken integrates both loss and attention signals, positioning it at the boundary between these categories. Related branches address alignment-based fine-tuning and efficiency optimization, showing that token selection research spans quality improvement, computational savings, and preference alignment, with ssToken focusing primarily on the quality dimension through adaptive selection.

Among twenty candidates examined across three contributions, the core ssToken approach shows one refutable candidate from ten examined, while the self-modulated mechanism (two candidates) and semantic metric (eight candidates) show no clear refutations. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The self-modulated contribution appears more novel given zero refutations from two candidates, whereas the semantic metric's zero refutations from eight candidates suggests either genuine novelty or that the search did not surface closely related attention-based importance work. The single refutation for the overall approach indicates some prior overlap exists within the examined literature.

Based on the limited twenty-candidate search, ssToken appears to occupy a recognizable position within loss-based token selection while introducing distinguishing features through history model comparison and semantic weighting. The taxonomy context shows this is an established research direction with multiple competing methods, and the contribution-level statistics suggest incremental advancement rather than a paradigm shift. The analysis cannot assess whether exhaustive search would reveal additional overlapping work, particularly in the attention-based semantic estimation component.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Token-level data selection for supervised fine-tuning of large language models. The field has evolved from traditional sample-level curation toward finer-grained strategies that identify which tokens within each training example contribute most to model improvement. The taxonomy reflects this progression through several main branches: Token-level Selection Methods focus on loss-based or gradient-based criteria to filter or reweight individual tokens (e.g., Rho-1[30], Not All Tokens[26]); Sample-level and Hybrid Selection encompasses approaches that combine instance-level filtering with token-aware heuristics; Alignment and Preference-based Fine-tuning addresses human feedback and preference optimization at varying granularities (Preference-grounded Token[9], Selective Preference Optimization[24]); Domain-specific and Task-adaptive Fine-tuning tailors selection to specialized corpora or downstream tasks; Efficiency and Resource Optimization targets computational savings through pruning or adaptive mechanisms (Dynamic Token Pruning[34], Memory-Efficient Token Selection[42]); Robustness and Generalization explores how selective training improves out-of-distribution performance; Specialized Fine-tuning Paradigms cover federated or continual learning settings; and Surveys and Frameworks provide overarching perspectives (Data Selection Survey[17], LLM Data Survey[35]). Within Token-level Selection Methods, a particularly active line of work leverages training loss to identify high-value tokens, balancing the desire to focus on informative examples against the risk of overfitting to noisy signals. ssToken[0] sits squarely in this Loss-based Token Selection cluster, emphasizing principled criteria for deciding which tokens merit gradient updates. Nearby efforts such as Token Cleaning[7] and Rho-1[30] similarly exploit loss or perplexity thresholds but differ in how aggressively they prune low-utility tokens versus reweighting them. A key trade-off across these methods is whether to discard tokens entirely or adjust their contribution dynamically, with some works (e.g., Token Weighting[31]) favoring soft weighting schemes. Open questions remain around optimal threshold selection, the interplay between token-level and sample-level quality, and how these strategies scale to diverse domains beyond general instruction-following.

Claimed Contributions

ssToken: Self-modulated and Semantic-aware Token Selection approach

The authors introduce ssToken, a novel token-level data selection method for LLM fine-tuning that combines self-modulated selection using history models and semantic-aware selection based on attention mechanisms, eliminating the need for additional reference models while preserving semantically important tokens.

10 retrieved papers
Can Refute
Self-modulated token selection using history models

The method uses per-token loss differences between history models and the current model to create an adaptive selection signal along the optimization trajectory, replacing the need for offline-trained reference models used in prior work.

2 retrieved papers
Semantic-aware attention-based token importance estimation metric

The authors develop an attention-based metric for estimating token importance that captures semantic information complementary to loss-based selection, enabling more effective filtering of tokens during fine-tuning.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ssToken: Self-modulated and Semantic-aware Token Selection approach

The authors introduce ssToken, a novel token-level data selection method for LLM fine-tuning that combines self-modulated selection using history models and semantic-aware selection based on attention mechanisms, eliminating the need for additional reference models while preserving semantically important tokens.

Contribution

Self-modulated token selection using history models

The method uses per-token loss differences between history models and the current model to create an adaptive selection signal along the optimization trajectory, replacing the need for offline-trained reference models used in prior work.

Contribution

Semantic-aware attention-based token importance estimation metric

The authors develop an attention-based metric for estimating token importance that captures semantic information complementary to loss-based selection, enabling more effective filtering of tokens during fine-tuning.