Closing the Gap Between Text and Speech Understanding in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Speech language modelslarge language modelsmultimodal language modelsmodality alignmentcross-modal alignmentcross-modal transfercross-modal distillationmodality gapspeech processing

Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts—and even cascaded pipelines—on language understanding tasks. We term this shortfall the text–speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD—Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation—which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from publicly available corpora.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper addresses the text–speech understanding gap in speech-adapted LLMs through a dual-factor analysis (forgetting and cross-modal misalignment) and proposes SALAD, a sample-efficient alignment method. It resides in the 'Encoder-LLM Connection Mechanisms' leaf, which contains only three papers total, including this one. This leaf sits within the broader 'Speech-to-Text Integration Architectures' branch, indicating a relatively focused research direction concerned with how speech encoders connect to LLM decoders. The small sibling count suggests this specific framing—connector design and alignment strategies—is not yet densely populated, though neighboring leaves address related decoder-only architectures and tokenization approaches.

The taxonomy reveals that the paper's immediate neighbors explore architectural bridging (e.g., adapters, projectors) and prompt-based recognition techniques, while parallel branches investigate parameter-efficient tuning, multi-task training, and rescoring methods. The 'Speech-Augmented LLM Training and Adaptation' branch, for instance, contains work on low-rank adaptation and curriculum learning, which share the goal of preserving text capabilities while adding speech modality. The taxonomy's scope and exclude notes clarify that this leaf focuses on explicit encoder-connector-decoder separation, distinguishing it from unified decoder-only models and from broader multimodal frameworks that integrate vision or audio events alongside speech.

Among fourteen candidates examined, the analysis found three refutable pairs across the paper's contributions. The first contribution (gap quantification via forgetting and misalignment) examined four candidates, with two appearing to provide overlapping prior work. The second contribution (analysis of training objectives and data regimes) examined ten candidates, yielding one refutable match. The third contribution (SALAD method) examined zero candidates, suggesting no direct prior work was identified within this limited search scope. These statistics indicate that the conceptual framing of the gap may have precedent, while the specific SALAD approach appears less directly anticipated in the top-fourteen semantic matches.

Given the limited search scope of fourteen candidates, the analysis captures nearby work but cannot claim exhaustive coverage. The small leaf size and moderate refutation counts suggest the paper operates in a moderately explored niche, with some conceptual overlap on gap analysis but potentially novel methodological contributions in the SALAD framework. The taxonomy context indicates that while encoder-LLM connection is an active concern, this specific combination of distillation and active selection may represent a distinct angle within the broader alignment challenge.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Adapting large language models to process speech inputs. The field has evolved into several major branches that reflect different strategies for bridging the gap between speech and text-based LLMs. Speech-to-Text Integration Architectures focus on how to connect speech encoders to LLMs, exploring various encoder-LLM connection mechanisms such as adapters, projectors, and cross-attention layers. Speech-Augmented LLM Training and Adaptation addresses how to fine-tune or adapt pretrained LLMs for speech understanding, while Speech Recognition with LLMs leverages LLM capabilities for improved transcription and formatting. Multimodal Speech-Language Models combine speech with other modalities like vision, and Omni-Modal Language Models aim for unified processing across many input types. Speech Generation and Synthesis with LLMs explores the reverse direction, using LLMs to produce speech outputs, while Supporting Methods and Resources provide datasets, benchmarks, and auxiliary techniques. Representative works like Salmonn[3], AudioPaLM[6], and SpeechGPT[7] illustrate different architectural choices across these branches. A particularly active area involves designing effective connection mechanisms between speech encoders and LLMs, where trade-offs emerge between computational efficiency, information preservation, and adaptation complexity. Works like Connecting Speech Encoder[13] and Prompting Speech Recognition[2] explore how to best align speech representations with LLM token spaces, while Closing Gap Text Speech[0] sits within this encoder-LLM connection landscape, emphasizing strategies to minimize the representational mismatch between modalities. Compared to neighbors like Prompting Speech Recognition[2], which focuses on prompt-based techniques for recognition tasks, and Connecting Speech Encoder[13], which examines architectural bridging strategies, the original work appears to concentrate on reducing the fundamental gap between speech and text representations. This theme resonates across many studies, as researchers grapple with whether to preserve rich acoustic details or compress speech into text-like tokens, and whether to freeze LLM parameters or allow joint training.

Claimed Contributions

Quantification of text–speech understanding gap via forgetting and cross-modal misalignment

Can Refute

4 retrieved papers

The authors formalize and measure two factors driving the performance gap between speech-adapted and text-based LLMs: forgetting (loss of pretrained text capabilities) and cross-modal misalignment (inconsistent outputs for equivalent speech and text inputs). They demonstrate these measures strongly predict downstream language understanding performance.

4 retrieved papers

Can Refute

Analysis of training objectives and data regimes on forgetting and misalignment

Can Refute

10 retrieved papers

The authors analyze how different training objectives (maximum likelihood vs. cross-modal distillation) and data domains (narrow vs. broad) affect forgetting and misalignment. They find that cross-modal distillation is more effective than standard maximum likelihood training for reducing both issues.

10 retrieved papers

Can Refute

SALAD method combining cross-modal distillation with active data selection

0 retrieved papers

The authors propose SALAD, a two-stage training method that first applies cross-modal distillation on natural speech, then uses active learning to select text samples for synthesis based on model-detected misalignment. This approach achieves competitive performance while using over an order of magnitude less training data than existing methods.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Prompting large language models with speech recognition abilities PDF

Yassir Fathullah, Wu Chunyang, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, J. Jia, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer, M. Seltzer (2024)

[13] Connecting Speech Encoder and Large Language Model for ASR PDF

Wenyi Yu, Changli Tang, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, T. Tan, Lu Lu, Zejun Ma, Chao Zhang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Quantification of text–speech understanding gap via forgetting and cross-modal misalignment

[51] Ssr: Alignment-aware modality connector for speech language models PDF

Can Refute

[52] Cross-Modal Knowledge Distillation for Speech Large Language Models PDF

Can Refute

[53] Wings: Learning multimodal llms without text-only forgetting PDF

Cannot Refute

[54] DeSTA2. 5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment PDF

Cannot Refute

Contribution

Analysis of training objectives and data regimes on forgetting and misalignment

[52] Cross-Modal Knowledge Distillation for Speech Large Language Models PDF

Can Refute

[7] Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities PDF

Cannot Refute

[55] Decoding knowledge transfer for neural text-to-speech training PDF

Cannot Refute

[56] Cross-modal distillation for speaker recognition PDF

Cannot Refute

[57] Linguistic Knowledge Transfer Learning for Speech Enhancement PDF

Cannot Refute

[58] End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction PDF

Cannot Refute

[59] Cross-Modal Distillation For Widely Differing Modalities PDF

Cannot Refute

[60] Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition PDF

Cannot Refute

[61] Dm-codec: Distilling multimodal representations for speech tokenization PDF

Cannot Refute

[62] Cross-Layer Similarity Knowledge Distillation for Speech Enhancement PDF

Cannot Refute

Contribution

Closing the Gap Between Text and Speech Understanding in LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Prompting large language models with speech recognition abilities PDF

[13] Connecting Speech Encoder and Large Language Model for ASR PDF

Contribution Analysis

Quantification of text–speech understanding gap via forgetting and cross-modal misalignment

[51] Ssr: Alignment-aware modality connector for speech language models PDF

[52] Cross-Modal Knowledge Distillation for Speech Large Language Models PDF

[53] Wings: Learning multimodal llms without text-only forgetting PDF

[54] DeSTA2. 5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment PDF

Analysis of training objectives and data regimes on forgetting and misalignment

[52] Cross-Modal Knowledge Distillation for Speech Large Language Models PDF

[7] Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities PDF

[55] Decoding knowledge transfer for neural text-to-speech training PDF

[56] Cross-modal distillation for speaker recognition PDF

[57] Linguistic Knowledge Transfer Learning for Speech Enhancement PDF

[58] End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction PDF

[59] Cross-Modal Distillation For Widely Differing Modalities PDF

[60] Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition PDF

[61] Dm-codec: Distilling multimodal representations for speech tokenization PDF

[62] Cross-Layer Similarity Knowledge Distillation for Speech Enhancement PDF

SALAD method combining cross-modal distillation with active data selection

Table of Contents