MultimodalarXiv cs.AI — 15 d ago

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

The article presents a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing, utilizing rectified flow matching to enhance performance while maintaining efficiency. This model leverages joint attention mechanisms over audio and text tokens, initially establishing coarse semantic alignment at a low-resolution stage before refining details at high resolution, thus addressing the limitations of existing convolutional U-Net approaches. The proposed framework demonstrates significant improvements in editing tasks with overlapping audio events and complex instructions, making it a valuable tool for practitioners focused on efficient and precise audio content manipulation.

audio editingdiffusion modelsinstruction-guidedrelevance 0.00 · engagement 0.00

Read at source ↗← all news