Closing the Gap Between Text and Speech Understanding in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Speech language modelslarge language modelsmultimodal language modelsmodality alignmentcross-modal alignmentcross-modal transfercross-modal distillationmodality gapspeech processing
Abstract:

Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts—and even cascaded pipelines—on language understanding tasks. We term this shortfall the text–speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD—Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation—which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from publicly available corpora.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper addresses the text–speech understanding gap in speech-adapted LLMs through a dual-factor analysis (forgetting and cross-modal misalignment) and proposes SALAD, a sample-efficient alignment method. It resides in the 'Encoder-LLM Connection Mechanisms' leaf, which contains only three papers total, including this one. This leaf sits within the broader 'Speech-to-Text Integration Architectures' branch, indicating a relatively focused research direction concerned with how speech encoders connect to LLM decoders. The small sibling count suggests this specific framing—connector design and alignment strategies—is not yet densely populated, though neighboring leaves address related decoder-only architectures and tokenization approaches.

The taxonomy reveals that the paper's immediate neighbors explore architectural bridging (e.g., adapters, projectors) and prompt-based recognition techniques, while parallel branches investigate parameter-efficient tuning, multi-task training, and rescoring methods. The 'Speech-Augmented LLM Training and Adaptation' branch, for instance, contains work on low-rank adaptation and curriculum learning, which share the goal of preserving text capabilities while adding speech modality. The taxonomy's scope and exclude notes clarify that this leaf focuses on explicit encoder-connector-decoder separation, distinguishing it from unified decoder-only models and from broader multimodal frameworks that integrate vision or audio events alongside speech.

Among fourteen candidates examined, the analysis found three refutable pairs across the paper's contributions. The first contribution (gap quantification via forgetting and misalignment) examined four candidates, with two appearing to provide overlapping prior work. The second contribution (analysis of training objectives and data regimes) examined ten candidates, yielding one refutable match. The third contribution (SALAD method) examined zero candidates, suggesting no direct prior work was identified within this limited search scope. These statistics indicate that the conceptual framing of the gap may have precedent, while the specific SALAD approach appears less directly anticipated in the top-fourteen semantic matches.

Given the limited search scope of fourteen candidates, the analysis captures nearby work but cannot claim exhaustive coverage. The small leaf size and moderate refutation counts suggest the paper operates in a moderately explored niche, with some conceptual overlap on gap analysis but potentially novel methodological contributions in the SALAD framework. The taxonomy context indicates that while encoder-LLM connection is an active concern, this specific combination of distillation and active selection may represent a distinct angle within the broader alignment challenge.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
14
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Adapting large language models to process speech inputs. The field has evolved into several major branches that reflect different strategies for bridging the gap between speech and text-based LLMs. Speech-to-Text Integration Architectures focus on how to connect speech encoders to LLMs, exploring various encoder-LLM connection mechanisms such as adapters, projectors, and cross-attention layers. Speech-Augmented LLM Training and Adaptation addresses how to fine-tune or adapt pretrained LLMs for speech understanding, while Speech Recognition with LLMs leverages LLM capabilities for improved transcription and formatting. Multimodal Speech-Language Models combine speech with other modalities like vision, and Omni-Modal Language Models aim for unified processing across many input types. Speech Generation and Synthesis with LLMs explores the reverse direction, using LLMs to produce speech outputs, while Supporting Methods and Resources provide datasets, benchmarks, and auxiliary techniques. Representative works like Salmonn[3], AudioPaLM[6], and SpeechGPT[7] illustrate different architectural choices across these branches. A particularly active area involves designing effective connection mechanisms between speech encoders and LLMs, where trade-offs emerge between computational efficiency, information preservation, and adaptation complexity. Works like Connecting Speech Encoder[13] and Prompting Speech Recognition[2] explore how to best align speech representations with LLM token spaces, while Closing Gap Text Speech[0] sits within this encoder-LLM connection landscape, emphasizing strategies to minimize the representational mismatch between modalities. Compared to neighbors like Prompting Speech Recognition[2], which focuses on prompt-based techniques for recognition tasks, and Connecting Speech Encoder[13], which examines architectural bridging strategies, the original work appears to concentrate on reducing the fundamental gap between speech and text representations. This theme resonates across many studies, as researchers grapple with whether to preserve rich acoustic details or compress speech into text-like tokens, and whether to freeze LLM parameters or allow joint training.

Claimed Contributions

Quantification of text–speech understanding gap via forgetting and cross-modal misalignment

The authors formalize and measure two factors driving the performance gap between speech-adapted and text-based LLMs: forgetting (loss of pretrained text capabilities) and cross-modal misalignment (inconsistent outputs for equivalent speech and text inputs). They demonstrate these measures strongly predict downstream language understanding performance.

4 retrieved papers
Can Refute
Analysis of training objectives and data regimes on forgetting and misalignment

The authors analyze how different training objectives (maximum likelihood vs. cross-modal distillation) and data domains (narrow vs. broad) affect forgetting and misalignment. They find that cross-modal distillation is more effective than standard maximum likelihood training for reducing both issues.

10 retrieved papers
Can Refute
SALAD method combining cross-modal distillation with active data selection

The authors propose SALAD, a two-stage training method that first applies cross-modal distillation on natural speech, then uses active learning to select text samples for synthesis based on model-detected misalignment. This approach achieves competitive performance while using over an order of magnitude less training data than existing methods.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Quantification of text–speech understanding gap via forgetting and cross-modal misalignment

The authors formalize and measure two factors driving the performance gap between speech-adapted and text-based LLMs: forgetting (loss of pretrained text capabilities) and cross-modal misalignment (inconsistent outputs for equivalent speech and text inputs). They demonstrate these measures strongly predict downstream language understanding performance.

Contribution

Analysis of training objectives and data regimes on forgetting and misalignment

The authors analyze how different training objectives (maximum likelihood vs. cross-modal distillation) and data domains (narrow vs. broad) affect forgetting and misalignment. They find that cross-modal distillation is more effective than standard maximum likelihood training for reducing both issues.

Contribution

SALAD method combining cross-modal distillation with active data selection

The authors propose SALAD, a two-stage training method that first applies cross-modal distillation on natural speech, then uses active learning to select text samples for synthesis based on model-detected misalignment. This approach achieves competitive performance while using over an order of magnitude less training data than existing methods.

Closing the Gap Between Text and Speech Understanding in LLMs | Novelty Validation