Closing the Gap Between Text and Speech Understanding in LLMs
Overview
Overall Novelty Assessment
The paper addresses the text–speech understanding gap in speech-adapted LLMs through a dual-factor analysis (forgetting and cross-modal misalignment) and proposes SALAD, a sample-efficient alignment method. It resides in the 'Encoder-LLM Connection Mechanisms' leaf, which contains only three papers total, including this one. This leaf sits within the broader 'Speech-to-Text Integration Architectures' branch, indicating a relatively focused research direction concerned with how speech encoders connect to LLM decoders. The small sibling count suggests this specific framing—connector design and alignment strategies—is not yet densely populated, though neighboring leaves address related decoder-only architectures and tokenization approaches.
The taxonomy reveals that the paper's immediate neighbors explore architectural bridging (e.g., adapters, projectors) and prompt-based recognition techniques, while parallel branches investigate parameter-efficient tuning, multi-task training, and rescoring methods. The 'Speech-Augmented LLM Training and Adaptation' branch, for instance, contains work on low-rank adaptation and curriculum learning, which share the goal of preserving text capabilities while adding speech modality. The taxonomy's scope and exclude notes clarify that this leaf focuses on explicit encoder-connector-decoder separation, distinguishing it from unified decoder-only models and from broader multimodal frameworks that integrate vision or audio events alongside speech.
Among fourteen candidates examined, the analysis found three refutable pairs across the paper's contributions. The first contribution (gap quantification via forgetting and misalignment) examined four candidates, with two appearing to provide overlapping prior work. The second contribution (analysis of training objectives and data regimes) examined ten candidates, yielding one refutable match. The third contribution (SALAD method) examined zero candidates, suggesting no direct prior work was identified within this limited search scope. These statistics indicate that the conceptual framing of the gap may have precedent, while the specific SALAD approach appears less directly anticipated in the top-fourteen semantic matches.
Given the limited search scope of fourteen candidates, the analysis captures nearby work but cannot claim exhaustive coverage. The small leaf size and moderate refutation counts suggest the paper operates in a moderately explored niche, with some conceptual overlap on gap analysis but potentially novel methodological contributions in the SALAD framework. The taxonomy context indicates that while encoder-LLM connection is an active concern, this specific combination of distillation and active selection may represent a distinct angle within the broader alignment challenge.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors formalize and measure two factors driving the performance gap between speech-adapted and text-based LLMs: forgetting (loss of pretrained text capabilities) and cross-modal misalignment (inconsistent outputs for equivalent speech and text inputs). They demonstrate these measures strongly predict downstream language understanding performance.
The authors analyze how different training objectives (maximum likelihood vs. cross-modal distillation) and data domains (narrow vs. broad) affect forgetting and misalignment. They find that cross-modal distillation is more effective than standard maximum likelihood training for reducing both issues.
The authors propose SALAD, a two-stage training method that first applies cross-modal distillation on natural speech, then uses active learning to select text samples for synthesis based on model-detected misalignment. This approach achieves competitive performance while using over an order of magnitude less training data than existing methods.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Prompting large language models with speech recognition abilities PDF
[13] Connecting Speech Encoder and Large Language Model for ASR PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Quantification of text–speech understanding gap via forgetting and cross-modal misalignment
The authors formalize and measure two factors driving the performance gap between speech-adapted and text-based LLMs: forgetting (loss of pretrained text capabilities) and cross-modal misalignment (inconsistent outputs for equivalent speech and text inputs). They demonstrate these measures strongly predict downstream language understanding performance.
[51] Ssr: Alignment-aware modality connector for speech language models PDF
[52] Cross-Modal Knowledge Distillation for Speech Large Language Models PDF
[53] Wings: Learning multimodal llms without text-only forgetting PDF
[54] DeSTA2. 5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment PDF
Analysis of training objectives and data regimes on forgetting and misalignment
The authors analyze how different training objectives (maximum likelihood vs. cross-modal distillation) and data domains (narrow vs. broad) affect forgetting and misalignment. They find that cross-modal distillation is more effective than standard maximum likelihood training for reducing both issues.
[52] Cross-Modal Knowledge Distillation for Speech Large Language Models PDF
[7] Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities PDF
[55] Decoding knowledge transfer for neural text-to-speech training PDF
[56] Cross-modal distillation for speaker recognition PDF
[57] Linguistic Knowledge Transfer Learning for Speech Enhancement PDF
[58] End-to-end voice conversion via cross-modal knowledge distillation for dysarthric speech reconstruction PDF
[59] Cross-Modal Distillation For Widely Differing Modalities PDF
[60] Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition PDF
[61] Dm-codec: Distilling multimodal representations for speech tokenization PDF
[62] Cross-Layer Similarity Knowledge Distillation for Speech Enhancement PDF
SALAD method combining cross-modal distillation with active data selection
The authors propose SALAD, a two-stage training method that first applies cross-modal distillation on natural speech, then uses active learning to select text samples for synthesis based on model-detected misalignment. This approach achieves competitive performance while using over an order of magnitude less training data than existing methods.