AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer
Overview
Overall Novelty Assessment
The paper proposes AC-Foley, a reference-audio-guided video-to-audio synthesis framework that uses audio signals as direct conditioning inputs to achieve fine-grained control over timbre, acoustic attributes, and sound event characteristics. According to the taxonomy, this work resides in the 'Reference-Audio-Guided Foley Synthesis' leaf under 'Audio-Conditioned Video-to-Audio Generation'. This leaf contains only two papers total, including the original work, indicating a relatively sparse and emerging research direction within the broader field of audio-conditioned synthesis.
The taxonomy reveals that AC-Foley sits within a broader ecosystem of audio-conditioned approaches. Its immediate sibling is Negative Audio Guidance, which explores steering generation away from undesired features rather than toward reference characteristics. Nearby, the 'Multimodal Controllable Audio-Video Synthesis' leaf contains frameworks that combine text, reference audio, and reference images for flexible generation. The taxonomy explicitly excludes text-only conditioning methods (housed under 'Text-Conditioned Video-to-Audio Generation') and joint audio-visual synthesis approaches, clarifying that AC-Foley's focus on direct audio conditioning distinguishes it from multimodal or language-driven alternatives.
The contribution-level analysis examined three candidate papers and identified two that appear to provide overlapping prior work, suggesting that among this limited search scope, the core AC-Foley framework faces some precedent. The analysis does not indicate which specific aspects (timbre transfer, zero-shot generation, or fine-grained control) are most affected by prior work. Given the small candidate pool (three papers examined, not thirty or three hundred), this assessment reflects only the most semantically proximate literature and cannot rule out additional relevant work in adjacent research areas or under different terminology.
Based on the limited search scope, the work appears to occupy a sparsely populated research direction with only one sibling paper in its taxonomy leaf. However, the presence of two refutable candidates among three examined suggests that the core technical approach may have meaningful precedent within the immediate neighborhood of reference-audio-guided synthesis. The analysis covers top-K semantic matches and does not extend to exhaustive citation networks or alternative formulations of the problem.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce AC-Foley, a framework that synthesizes audio from video by leveraging a reference audio clip to guide the generation process through acoustic transfer. This enables controllable audio generation that matches both the visual content and desired acoustic characteristics.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
AC-Foley framework for reference-audio-guided video-to-audio synthesis
The authors introduce AC-Foley, a framework that synthesizes audio from video by leveraging a reference audio clip to guide the generation process through acoustic transfer. This enables controllable audio generation that matches both the visual content and desired acoustic characteristics.