AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

ICLR 2026 Conference SubmissionAnonymous Authors
Video-to-Audio Generation; Audio Generation;
Abstract:

Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data (e.g., conflating acoustically distinct sounds like different dog barks under coarse labels), and textual ambiguity in describing microacoustic features (e.g., "metallic clang" failing to distinguish impact transients and resonance decay). These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables: fine-grained sound synthesis (e.g., footsteps with distinct timbres on wood, marble, or gravel), timbre transfer (e.g., transforming a violin’s melody into the bright, piercing tone of a suona), zero-shot generation of sounds (e.g., creating unique weapon sound effects without training on firearm datasets) and better audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with SOTA video-to-audio methods even without audio conditioning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes AC-Foley, a reference-audio-guided video-to-audio synthesis framework that uses audio signals as direct conditioning inputs to achieve fine-grained control over timbre, acoustic attributes, and sound event characteristics. According to the taxonomy, this work resides in the 'Reference-Audio-Guided Foley Synthesis' leaf under 'Audio-Conditioned Video-to-Audio Generation'. This leaf contains only two papers total, including the original work, indicating a relatively sparse and emerging research direction within the broader field of audio-conditioned synthesis.

The taxonomy reveals that AC-Foley sits within a broader ecosystem of audio-conditioned approaches. Its immediate sibling is Negative Audio Guidance, which explores steering generation away from undesired features rather than toward reference characteristics. Nearby, the 'Multimodal Controllable Audio-Video Synthesis' leaf contains frameworks that combine text, reference audio, and reference images for flexible generation. The taxonomy explicitly excludes text-only conditioning methods (housed under 'Text-Conditioned Video-to-Audio Generation') and joint audio-visual synthesis approaches, clarifying that AC-Foley's focus on direct audio conditioning distinguishes it from multimodal or language-driven alternatives.

The contribution-level analysis examined three candidate papers and identified two that appear to provide overlapping prior work, suggesting that among this limited search scope, the core AC-Foley framework faces some precedent. The analysis does not indicate which specific aspects (timbre transfer, zero-shot generation, or fine-grained control) are most affected by prior work. Given the small candidate pool (three papers examined, not thirty or three hundred), this assessment reflects only the most semantically proximate literature and cannot rule out additional relevant work in adjacent research areas or under different terminology.

Based on the limited search scope, the work appears to occupy a sparsely populated research direction with only one sibling paper in its taxonomy leaf. However, the presence of two refutable candidates among three examined suggests that the core technical approach may have meaningful precedent within the immediate neighborhood of reference-audio-guided synthesis. The analysis covers top-K semantic matches and does not extend to exhaustive citation networks or alternative formulations of the problem.

Taxonomy

Core-task Taxonomy Papers
14
1
Claimed Contributions
3
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Reference-audio-guided video-to-audio synthesis. The field centers on generating or manipulating audio tracks that align with visual content, often leveraging reference audio to guide style, timbre, or semantic characteristics. The taxonomy reveals several main branches: Audio-Conditioned Video-to-Audio Generation focuses on using reference audio signals to steer synthesis, enabling tasks like Foley generation where sound effects match both visual events and a desired acoustic style. Text-Conditioned Video-to-Audio Generation relies on language prompts to specify audio properties, while Joint Audio-Visual Generation and Editing addresses simultaneous creation or modification of both modalities. Audio-Driven Visual Speech and Motion Synthesis reverses the direction, using audio to animate faces or bodies, and Audio-Reactive Visual Synthesis explores how sound can drive abstract or artistic visual outputs. Representative works such as Multimodal Foley[6] and Negative Audio Guidance[2] illustrate how reference audio can shape the timbral and semantic qualities of synthesized soundtracks. Within the audio-conditioned branch, a particularly active line of work explores reference-guided Foley synthesis, where the challenge is to produce realistic sound effects that respect both the timing of visual events and the acoustic character of a reference clip. AC-Foley[0] sits squarely in this cluster, emphasizing fine-grained control over timbre and event synchronization through reference audio. Nearby, Negative Audio Guidance[2] tackles a complementary problem by steering generation away from undesired acoustic features, highlighting trade-offs between positive style transfer and negative constraints. Other works like Multimodal Foley[6] integrate text and audio cues, broadening the conditioning space. A key open question across these methods is how to balance fidelity to reference audio with flexibility for novel visual contexts, and how to handle temporal misalignment between reference and target sequences. AC-Foley[0] addresses this by proposing mechanisms that decouple style from timing, positioning it as a step toward more versatile reference-guided synthesis.

Claimed Contributions

AC-Foley framework for reference-audio-guided video-to-audio synthesis

The authors introduce AC-Foley, a framework that synthesizes audio from video by leveraging a reference audio clip to guide the generation process through acoustic transfer. This enables controllable audio generation that matches both the visual content and desired acoustic characteristics.

3 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

AC-Foley framework for reference-audio-guided video-to-audio synthesis

The authors introduce AC-Foley, a framework that synthesizes audio from video by leveraging a reference audio clip to guide the generation process through acoustic transfer. This enables controllable audio generation that matches both the visual content and desired acoustic characteristics.

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer | Novelty Validation