When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLM Ensembleprobability-level ensemblespeculative decoding

Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models’ next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining the ensembling positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose $\textbf{SAFE}$ , ( $\textbf{S}$ table $\textbf{A}$ nd $\textbf{F}$ ast LLM $\textbf{E}$ nsembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we apply a probability sharpening strategy when the ensemble distribution becomes overly smooth, enabling the selection of more confident tokens during ensembling. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SAFE, a framework for selective token-level ensembling in long-form LLM generation, addressing when and where to combine multiple models' probability distributions. It resides in the 'Selective and Adaptive Ensembling' leaf under 'Probability Distribution Fusion', which contains only three papers total including this work. This represents a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific problem of position-aware ensembling for long-form generation has received limited prior attention compared to other token-level techniques.

The taxonomy reveals that neighboring leaves address related but distinct challenges: 'Vocabulary Alignment for Heterogeneous Models' tackles tokenization discrepancies across models, while 'Full-Sequence Ensembling' applies uniform aggregation without selective positioning. The parent branch 'Token-Level Ensemble Mechanisms' sits alongside 'Multi-Token Prediction and Speculative Decoding' and 'Token-Level Generation Optimization', indicating the field has explored acceleration and control strategies separately from ensemble fusion. SAFE's focus on consensus-based selection and tokenization mismatch bridges these areas by incorporating distribution characteristics into ensemble decisions.

Among twenty-eight candidates examined, the contribution identifying two key factors for ensembling positions shows one refutable candidate from ten examined, while the SAFE framework itself and the probability sharpening strategy each examined ten candidates with zero refutations. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The framework and sharpening contributions appear more novel within this bounded search, whereas the factor identification has at least one overlapping prior work among the candidates reviewed.

Based on the limited literature search of twenty-eight candidates, the work appears to occupy a sparsely populated niche within token-level ensembling. The taxonomy structure confirms that selective ensembling for long-form generation has fewer dedicated papers than adjacent areas like multi-token prediction or decoding strategies. However, the analysis cannot rule out relevant work outside the top-K semantic neighborhood or in related branches not captured by the search methodology.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: token-level ensembling for large language model generation. The field encompasses diverse strategies for combining or refining token-level outputs from large language models, organized into several major branches. Token-Level Ensemble Mechanisms and Architectures explores how to fuse probability distributions or adaptively select among multiple models, including methods like Top-k Union[4] and M-Ped[19] that merge predictions at each generation step. Multi-Token Prediction and Speculative Decoding focuses on predicting multiple tokens simultaneously or using draft models to accelerate inference, as seen in SpecInfer[15] and Multi-token Prediction[11]. Token-Level Generation Optimization and Control addresses fine-grained steering of outputs through feedback mechanisms such as Token Feedback RL[8] or gradient-based adjustments like Gradient Recomposition[10]. Meanwhile, Token-Level Analysis and Evaluation investigates uncertainty quantification and hallucination detection at the token level, exemplified by Token Hallucination Benchmark[7] and Reasoning Uncertainty[28]. Token Representation and Processing examines alternative tokenization schemes and cross-vocabulary techniques, while Specialized Applications and Domains apply token-level methods to areas like sequential recommendations or dialogue systems. Within the ensemble mechanisms branch, a particularly active line of work centers on selective and adaptive ensembling, where systems dynamically choose which models or distributions to combine based on context or confidence. Token Ensemble[0] falls squarely into this cluster, emphasizing adaptive fusion strategies that adjust ensemble weights or selection criteria at each token position. This contrasts with simpler uniform averaging approaches and aligns closely with methods like EnsemW2S[5], which also adapts ensemble behavior during generation. Compared to Top-k Union[4], which merges candidate sets from multiple models, Token Ensemble[0] appears to focus more on probability-level fusion with learned or heuristic selection rules. A key open question across these works is how to balance computational overhead against the quality gains from adaptive ensembling, especially when scaling to many models or long sequences. The interplay between token-level uncertainty estimation and ensemble selection remains an area of ongoing exploration, as researchers seek principled ways to determine when and how to combine diverse model outputs.

Claimed Contributions

SAFE framework for selective token-level ensembling

10 retrieved papers

The authors introduce SAFE, a generate-verify-ensemble framework that determines optimal token positions for ensembling by considering tokenization mismatch and consensus in next-token probability distributions. This selective approach improves both stability and efficiency in long-form generation.

10 retrieved papers

Probability sharpening strategy for ensemble distributions

10 retrieved papers

The authors propose a probability sharpening method that consolidates diffused probability mass when the ensemble distribution becomes overly smooth due to different tokenization schemes. This enables more confident token selection during ensembling.

10 retrieved papers

Identification of two key factors for ensembling positions

Can Refute

8 retrieved papers

The authors identify and formalize two critical factors that determine when ensembling should occur: whether tokenization mismatch introduces OOV-like tokens and whether models exhibit sufficient consensus in their next-token distributions. These factors guide the selective ensembling strategy.

8 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Determine-then-ensemble: Necessity of top-k union for large language model ensembling PDF

Yao YuXuan, Wu Han, Yuxuan Yao, Liu Ming-Yang, Han Wu, Luo, Sichun, Mingyang Liu, Han Xiongwei, Sichun Luo, Liu Jie, Xiongwei Han, Guo, Zhijiang, Jie Liu, Song, Linqi, Zhijiang Guo, Linqi Song (2024)

[19] M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models PDF

Jiaxin Guo, Daimeng Wei, Yuanchang Luo, Hengchao Shang, Zong-Yao Li, Jinlong Yang, Zhanglin Wu, Zhiqiang Rao, Shimin Tao, Hao Yang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SAFE framework for selective token-level ensembling

[7] A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation PDF

Cannot Refute

[51] Better & faster large language models via multi-token prediction PDF

Cannot Refute

[52] From $r$ to $Q^*$ : Your Language Model is Secretly a Q-Function PDF

Cannot Refute

[53] Large language model and text generation PDF

Cannot Refute

[54] Dynamic token hierarchies: Enhancing large language models with a multi-tiered token processing framework PDF

Cannot Refute

[55] Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification PDF

Cannot Refute

[56] Contextual morphogenesis in large language models: A novel approach to self-organizing token representations PDF

Cannot Refute

[57] LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion PDF

Cannot Refute

[58] Language Model Cascades: Token-level uncertainty and beyond PDF

Cannot Refute

[59] Token-level direct preference optimization PDF

Cannot Refute

Contribution

Probability sharpening strategy for ensemble distributions

[65] Knowledge fusion of large language models PDF

Cannot Refute

[66] Clap4clip: Continual learning with probabilistic finetuning for vision-language models PDF

Cannot Refute

[67] LoRA ensembles for large language model fine-tuning PDF

Cannot Refute

[68] Sled: Self logits evolution decoding for improving factuality in large language models PDF

Cannot Refute

[69] Rationale-Augmented Ensembles in Language Models PDF

Cannot Refute

[70] Calibrating language models via augmented prompt ensembles PDF

Cannot Refute

[71] Self-Improvement in Language Models: The Sharpening Mechanism PDF

Cannot Refute

[72] The remarkable robustness of llms: Stages of inference? PDF

Cannot Refute

[73] Calibrating language models with adaptive temperature scaling PDF

Cannot Refute

[74] TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs PDF

Cannot Refute

Contribution

Identification of two key factors for ensembling positions

[18] PToco: Prefix-based Token-level Collaboration Enhances Reasoning for Multi-LLMs PDF

Can Refute

[2] Ensemble learning for heterogeneous large language models with deep parallel collaboration PDF

Cannot Refute

[37] Token-level Ensembling of Models with Different Vocabularies PDF

Cannot Refute

[60] Mind the gap: A closer look at tokenization for multiple-choice question answering with llms PDF

Cannot Refute

[61] Refer to Anything with Vision-Language Prompts PDF

Cannot Refute

[62] Harnessing Consistency for Robust Test-Time LLM Ensemble PDF

Cannot Refute

[63] Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles PDF

Cannot Refute

[64] Lossless Vocabulary Reduction for Auto-Regressive Language Models PDF

Cannot Refute

When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Determine-then-ensemble: Necessity of top-k union for large language model ensembling PDF

[19] M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models PDF

Contribution Analysis

SAFE framework for selective token-level ensembling

[7] A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation PDF

[51] Better & faster large language models via multi-token prediction PDF

[52] From rrr to Q∗Q^*Q∗: Your Language Model is Secretly a Q-Function PDF

[53] Large language model and text generation PDF

[54] Dynamic token hierarchies: Enhancing large language models with a multi-tiered token processing framework PDF

[55] Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification PDF

[56] Contextual morphogenesis in large language models: A novel approach to self-organizing token representations PDF

[57] LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion PDF

[58] Language Model Cascades: Token-level uncertainty and beyond PDF

[59] Token-level direct preference optimization PDF

Probability sharpening strategy for ensemble distributions

[65] Knowledge fusion of large language models PDF

[66] Clap4clip: Continual learning with probabilistic finetuning for vision-language models PDF

[67] LoRA ensembles for large language model fine-tuning PDF

[68] Sled: Self logits evolution decoding for improving factuality in large language models PDF

[69] Rationale-Augmented Ensembles in Language Models PDF

[70] Calibrating language models via augmented prompt ensembles PDF

[71] Self-Improvement in Language Models: The Sharpening Mechanism PDF

[72] The remarkable robustness of llms: Stages of inference? PDF

[73] Calibrating language models with adaptive temperature scaling PDF

[74] TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs PDF

Identification of two key factors for ensembling positions

[18] PToco: Prefix-based Token-level Collaboration Enhances Reasoning for Multi-LLMs PDF

[2] Ensemble learning for heterogeneous large language models with deep parallel collaboration PDF

[37] Token-level Ensembling of Models with Different Vocabularies PDF

[60] Mind the gap: A closer look at tokenization for multiple-choice question answering with llms PDF

[61] Refer to Anything with Vision-Language Prompts PDF

[62] Harnessing Consistency for Robust Test-Time LLM Ensemble PDF

[63] Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles PDF

[64] Lossless Vocabulary Reduction for Auto-Regressive Language Models PDF

Table of Contents

[52] From $r$ to $Q^*$ : Your Language Model is Secretly a Q-Function PDF