Entropy-Monitored Kernelized Token Distillation for Audio-Visual Compression

ICLR 2026 Conference SubmissionAnonymous Authors
Audio-Visual LearningMultimodal LearningEfficient Machine LearningKnowledge DistillationAudio-Visual ClassificationAudio-Visual Segmentation
Abstract:

We propose a method for audio-visual knowledge distillation. Existing methods typically distill from the latent embeddings or outputs. The former requires matching feature dimensions, if not the same architecture, between teacher and student models while the latter supports any teacher-student pairing, but tends to be less performant. Unlike them, we do not explicitly distill from the latent embeddings or outputs, but the pairwise relationships between embeddings across samples for each modality; this is realized as a kernel, which is the crux of our method, ``Kernelized Token Distillation (KTD)''. Specifically, we tokenize and embed the input for a given modality, and compute the Gram matrix across tokens, from which we distill. As audio and visual modalities afford different information for a task, we adaptively modulate distillation by measuring the entropy of each modality, leading to an Entropty-Monitored Kernelized Token Distillation (EM-KTD) scheme. Our method allows flexibility in complexity of kernel function to model relationships across tokens, which are selectively distilled to ensure high-fidelity supervision for the student. We evaluate EM-KTD on VGGSound and AVS-Bench, where we use 94% fewer parameters than the teacher while preserving 96.9% in performance for audio-visual event recognition and 96.5% on audio-visual segmentation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Kernelized Token Distillation (KTD) and its entropy-monitored variant (EM-KTD) for audio-visual model compression. It resides in the Token and Representation-Level Distillation leaf of the taxonomy, which currently contains only this work as a sibling. This leaf represents a relatively sparse research direction focused on compressing audio-visual representations through token-level mechanisms rather than full model parameters or dataset synthesis, distinguishing it from the more populated Audio-Visual Model Compression branch.

The taxonomy reveals several neighboring directions. Audio-Visual Model Compression via Knowledge Distillation contains multiple application-specific leaves (speech recognition, video captioning, synchronization) with 2-4 papers each, totaling around 13 works. Cross-Modal and Vision-Language Model Compression addresses broader multimodal architectures beyond audio-visual pairs. The original paper's focus on token-level pairwise relationships and entropy-based modulation diverges from these branches, which typically distill from latent embeddings or outputs in task-specific contexts rather than modeling cross-sample token relationships.

Among 30 candidates examined, none clearly refute the three core contributions. Kernelized Token Distillation examined 10 candidates with 0 refutable matches; Entropy-Monitored distillation examined 10 with 0 refutable; demonstration on audio-visual tasks examined 10 with 0 refutable. This suggests that within the limited search scope, the specific combination of Gram-matrix-based token distillation and entropy-driven adaptive modulation appears relatively unexplored. However, the search scale is modest and does not cover the full landscape of token-level or kernel-based distillation methods.

Based on top-30 semantic matches, the work appears to occupy a niche intersection of token-level representation learning and audio-visual distillation. The taxonomy structure indicates this is an emerging rather than crowded area, though the limited search scope means potentially relevant kernel-based or token-centric methods in adjacent fields (e.g., vision-language models) may not have been fully captured. The analysis reflects what was examined, not an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers
27
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: audio-visual knowledge distillation for model compression. The field organizes around several complementary directions. At the data level, Audio-Visual Dataset and Data Distillation explores how to synthesize or condense training corpora for efficient learning, as seen in Audio-Visual Dataset Distillation[1]. A central branch, Audio-Visual Model Compression via Knowledge Distillation, focuses on transferring knowledge from large teacher models to compact student networks in multimodal settings, exemplified by works such as Decoupled Audio-Visual Distillation[2] and Audio Knowledge Visual Speech[3]. Token and Representation-Level Distillation examines finer-grained transfer mechanisms at intermediate layers or token embeddings. Cross-Modal and Vision-Language Model Compression extends these ideas to broader multimodal architectures, including vision-language models like those in Adaptive Matryoshka Multimodal[4] and Self-Adapting Visual-Language Edge[6]. Audio-Only Model Compression and Representation Learning addresses purely acoustic scenarios, such as Lightweight Wake Word Spotting[5] and DistilALHuBERT[14]. Finally, Surveys and Methodological Overviews provide meta-analyses and unifying frameworks across these branches. Recent work highlights trade-offs between modality-specific versus joint distillation strategies, and between task-agnostic representation learning and task-specific compression. Entropy Kernelized Token Distillation[0] sits squarely within the Token and Representation-Level Distillation branch, emphasizing entropy-based alignment of token-level features to preserve fine-grained multimodal structure during compression. This contrasts with approaches like Decoupled Audio-Visual Distillation[2], which separates audio and visual streams before recombining them, and Audio Knowledge Visual Speech[3], which leverages cross-modal supervision from audio to guide visual speech models. By focusing on token-level entropy kernels, Entropy Kernelized Token Distillation[0] offers a middle ground between holistic feature matching and modality-decoupled pipelines, aiming to retain rich representational detail while achieving efficient inference. Open questions remain about how such token-centric methods scale to very large multimodal transformers and whether they generalize across diverse audio-visual tasks beyond the settings explored so far.

Claimed Contributions

Kernelized Token Distillation (KTD)

KTD is a novel knowledge distillation method that distills pairwise relationships between token embeddings via kernel functions (Gram matrices) rather than directly distilling latent embeddings or outputs. This approach is architecture-agnostic, enabling flexible teacher-student pairings without requiring matching feature dimensions or architectures.

10 retrieved papers
Entropy-Monitored distillation scheme (EM-KTD)

An adaptive distillation strategy that measures the entropy of each modality's output distribution to selectively weight distillation contributions. This allows the student model to prioritize informative modalities while suppressing uninformative ones, ensuring high-fidelity supervision signals during training.

10 retrieved papers
Demonstration of EM-KTD on audio-visual tasks

The authors evaluate EM-KTD on VGGSound and AVS-Bench datasets for audio-visual event classification and segmentation tasks, achieving state-of-the-art performance while using only 6% of the teacher model parameters and preserving 96.9% of teacher performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Kernelized Token Distillation (KTD)

KTD is a novel knowledge distillation method that distills pairwise relationships between token embeddings via kernel functions (Gram matrices) rather than directly distilling latent embeddings or outputs. This approach is architecture-agnostic, enabling flexible teacher-student pairings without requiring matching feature dimensions or architectures.

Contribution

Entropy-Monitored distillation scheme (EM-KTD)

An adaptive distillation strategy that measures the entropy of each modality's output distribution to selectively weight distillation contributions. This allows the student model to prioritize informative modalities while suppressing uninformative ones, ensuring high-fidelity supervision signals during training.

Contribution

Demonstration of EM-KTD on audio-visual tasks

The authors evaluate EM-KTD on VGGSound and AVS-Bench datasets for audio-visual event classification and segmentation tasks, achieving state-of-the-art performance while using only 6% of the teacher model parameters and preserving 96.9% of teacher performance.