ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Large Language ModelsSupervised Fine-tuningData Selection

Data quality plays a critical role in enhancing supervised fine-tuning (SFT) for large language models (LLMs), and token-level data selection has emerged as a promising direction for its fine-grained nature. Despite their strong empirical performance, existing token-level selection methods share two key limitations: (1) requiring training or accessing an additional reference model, and (2) relying solely on loss information for token selection, which cannot well preserve semantically important tokens that are not favored by loss-based metrics. To address these challenges, we propose ssToken, a Self-modulated and Semantic-aware Token Selection approach. ssToken leverages readily accessible history models to compute the per-token loss difference with the current model, which serves as a self-modulated signal that enables the model to adaptively select tokens along its optimization trajectory, rather than relying on excess loss from an offline-trained reference model as in prior works. We further introduce a semantic-aware, attention-based token importance estimation metric, orthogonal to loss-based selection and providing complementary semantic information for more effective filtering. Extensive experiments across different model families and scales demonstrate that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning, while their integration—ssToken—achieves synergistic gains and further surpasses prior token-level selection methods, delivering performance improvements while maintaining training efficiency. Source code is available at https://anonymous.4open.science/r/Submission2116-B7C5.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ssToken, a token selection method that combines self-modulated loss differences with attention-based semantic importance. It resides in the Loss-based Token Selection leaf, which contains four papers including the original work. This leaf sits within the broader Token-level Selection Methods branch, indicating a moderately populated research direction focused on fine-grained data curation. The taxonomy shows that loss-based approaches represent one of several competing paradigms for token selection, alongside semantic methods and critical token identification, suggesting this is an active but not overcrowded subfield.

The taxonomy reveals neighboring leaves for Semantic and Attention-based Selection (three papers) and Token Weighting schemes (two papers), indicating that ssToken bridges multiple methodological traditions. The scope note for Loss-based Token Selection explicitly excludes purely semantic approaches, yet ssToken integrates both loss and attention signals, positioning it at the boundary between these categories. Related branches address alignment-based fine-tuning and efficiency optimization, showing that token selection research spans quality improvement, computational savings, and preference alignment, with ssToken focusing primarily on the quality dimension through adaptive selection.

Among twenty candidates examined across three contributions, the core ssToken approach shows one refutable candidate from ten examined, while the self-modulated mechanism (two candidates) and semantic metric (eight candidates) show no clear refutations. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The self-modulated contribution appears more novel given zero refutations from two candidates, whereas the semantic metric's zero refutations from eight candidates suggests either genuine novelty or that the search did not surface closely related attention-based importance work. The single refutation for the overall approach indicates some prior overlap exists within the examined literature.

Based on the limited twenty-candidate search, ssToken appears to occupy a recognizable position within loss-based token selection while introducing distinguishing features through history model comparison and semantic weighting. The taxonomy context shows this is an established research direction with multiple competing methods, and the contribution-level statistics suggest incremental advancement rather than a paradigm shift. The analysis cannot assess whether exhaustive search would reveal additional overlapping work, particularly in the attention-based semantic estimation component.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Token-level data selection for supervised fine-tuning of large language models. The field has evolved from traditional sample-level curation toward finer-grained strategies that identify which tokens within each training example contribute most to model improvement. The taxonomy reflects this progression through several main branches: Token-level Selection Methods focus on loss-based or gradient-based criteria to filter or reweight individual tokens (e.g., Rho-1[30], Not All Tokens[26]); Sample-level and Hybrid Selection encompasses approaches that combine instance-level filtering with token-aware heuristics; Alignment and Preference-based Fine-tuning addresses human feedback and preference optimization at varying granularities (Preference-grounded Token[9], Selective Preference Optimization[24]); Domain-specific and Task-adaptive Fine-tuning tailors selection to specialized corpora or downstream tasks; Efficiency and Resource Optimization targets computational savings through pruning or adaptive mechanisms (Dynamic Token Pruning[34], Memory-Efficient Token Selection[42]); Robustness and Generalization explores how selective training improves out-of-distribution performance; Specialized Fine-tuning Paradigms cover federated or continual learning settings; and Surveys and Frameworks provide overarching perspectives (Data Selection Survey[17], LLM Data Survey[35]). Within Token-level Selection Methods, a particularly active line of work leverages training loss to identify high-value tokens, balancing the desire to focus on informative examples against the risk of overfitting to noisy signals. ssToken[0] sits squarely in this Loss-based Token Selection cluster, emphasizing principled criteria for deciding which tokens merit gradient updates. Nearby efforts such as Token Cleaning[7] and Rho-1[30] similarly exploit loss or perplexity thresholds but differ in how aggressively they prune low-utility tokens versus reweighting them. A key trade-off across these methods is whether to discard tokens entirely or adjust their contribution dynamically, with some works (e.g., Token Weighting[31]) favoring soft weighting schemes. Open questions remain around optimal threshold selection, the interplay between token-level and sample-level quality, and how these strategies scale to diverse domains beyond general instruction-following.

Claimed Contributions

ssToken: Self-modulated and Semantic-aware Token Selection approach

Can Refute

10 retrieved papers

The authors introduce ssToken, a novel token-level data selection method for LLM fine-tuning that combines self-modulated selection using history models and semantic-aware selection based on attention mechanisms, eliminating the need for additional reference models while preserving semantically important tokens.

10 retrieved papers

Can Refute

Self-modulated token selection using history models

2 retrieved papers

The method uses per-token loss differences between history models and the current model to create an adaptive selection signal along the optimization trajectory, replacing the need for offline-trained reference models used in prior work.

2 retrieved papers

Semantic-aware attention-based token importance estimation metric

8 retrieved papers

The authors develop an attention-based metric for estimating token importance that captures semantic information complementary to loss-based selection, enabling more effective filtering of tokens during fine-tuning.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Token cleaning: Fine-grained data selection for llm supervised fine-tuning PDF

Pang Jinlong, Di Na, Jinlong Pang, Zhu Zhao-wei, Na Di, Wei, Jiaheng, Zhaowei Zhu, Cheng Hao, Jiaheng Wei, Qian Chen, Hao Cheng, Liu Yang, Chen Qian, Yang Liu (2025)

[26] Not all tokens are what you need for pretraining PDF

Weizhu Chen, Nan Duan, Zheng-Wen Lin, Yeyun Gong, Zhibin Gou, Jian Jiao, Xiao Liu, Chen Lin, Yelong Shen, Zhenghao Lin, Ruochen Xu, Yujiu Yang (2024)

[30] Rho-1: Not all tokens are what you need PDF

Lin, Zhenghao, Gou, Zhibin, Zheng-Wen Lin, Gong, Yeyun, Zhibin Gou, Liu, Xiao, Yeyun Gong, Shen, Yelong, Xiao Liu, Xu, Ruochen, Yelong Shen, Lin Chen, Ruochen Xu, Yang, Yujiu, Chen Lin, Jiao Jian, Yujiu Yang, Duan, Nan, Jian Jiao, Chen, Weizhu, Nan Duan, Weizhu Chen (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ssToken: Self-modulated and Semantic-aware Token Selection approach

[26] Not all tokens are what you need for pretraining PDF

Can Refute

[7] Token cleaning: Fine-grained data selection for llm supervised fine-tuning PDF

Cannot Refute

[10] Enhancing large language model reasoning via selective critical token fine-tuning PDF

Cannot Refute

[16] T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning PDF

Cannot Refute

[61] A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models PDF

Cannot Refute

[62] PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning PDF

Cannot Refute

[63] TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning PDF

Cannot Refute

[64] Dynamic token expansion through contextual morphogenesis in large language models PDF

Cannot Refute

[65] Flashvlm: Text-guided visual token selection for large multimodal models PDF

Cannot Refute

[66] Improving large language models with concept-aware fine-tuning PDF

Cannot Refute

Contribution

Self-modulated token selection using history models

[59] Structured convergence through latent epoch reshaping for reordering intermediate computations in large language model training PDF

Cannot Refute

[60] Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining PDF

Cannot Refute

Contribution

Semantic-aware attention-based token importance estimation metric

[51] Neural attention shaping with contextual embedding recalibration in language models PDF

Cannot Refute

[52] TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning PDF

Cannot Refute

[53] 3DS: Medical Domain Adaptation of LLMs via Decomposed Difficulty-based Data Selection PDF

Cannot Refute

[54] LAMB: A Training-Free Method to Enhance the Long-Context Understanding of SSMs via Attention-Guided Token Filtering PDF

Cannot Refute

[55] Enhancing large language models through dynamic contextual memory embedding: A technical evaluation PDF

Cannot Refute

[56] 3DS: Decomposed Difficulty Data Selection's Case Study on LLM Medical Domain Adaptation PDF

Cannot Refute

[57] LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models PDF

Cannot Refute

[58] KVCompose: Efficient Structured KV Cache Compression with Composite Tokens PDF

Cannot Refute

ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Token cleaning: Fine-grained data selection for llm supervised fine-tuning PDF

[26] Not all tokens are what you need for pretraining PDF

[30] Rho-1: Not all tokens are what you need PDF

Contribution Analysis

ssToken: Self-modulated and Semantic-aware Token Selection approach

[26] Not all tokens are what you need for pretraining PDF

[7] Token cleaning: Fine-grained data selection for llm supervised fine-tuning PDF

[10] Enhancing large language model reasoning via selective critical token fine-tuning PDF

[16] T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning PDF

[61] A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models PDF

[62] PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning PDF

[63] TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning PDF

[64] Dynamic token expansion through contextual morphogenesis in large language models PDF

[65] Flashvlm: Text-guided visual token selection for large multimodal models PDF

[66] Improving large language models with concept-aware fine-tuning PDF

Self-modulated token selection using history models

[59] Structured convergence through latent epoch reshaping for reordering intermediate computations in large language model training PDF

[60] Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining PDF

Semantic-aware attention-based token importance estimation metric

[51] Neural attention shaping with contextual embedding recalibration in language models PDF

[52] TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning PDF

[53] 3DS: Medical Domain Adaptation of LLMs via Decomposed Difficulty-based Data Selection PDF

[54] LAMB: A Training-Free Method to Enhance the Long-Context Understanding of SSMs via Attention-Guided Token Filtering PDF

[55] Enhancing large language models through dynamic contextual memory embedding: A technical evaluation PDF

[56] 3DS: Decomposed Difficulty Data Selection's Case Study on LLM Medical Domain Adaptation PDF

[57] LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models PDF

[58] KVCompose: Efficient Structured KV Cache Compression with Composite Tokens PDF

Table of Contents