MultimodalarXiv cs.AI — 21 h ago

AuRA: Internalizing Audio Understanding into LLMs as LoRA

AuRA introduces a novel method for integrating audio understanding directly into large language models (LLMs) via a lightweight audio embedding layer and layer-wise distillation from an ASR encoder to a LoRA-adapted LLM. This approach allows for tighter speech-language joint modeling and efficient parallel inference, outperforming traditional cascaded systems and large-scale multimodal models on various benchmarks. Practitioners can leverage AuRA to enhance LLM capabilities with audio inputs without incurring the costs of extensive multimodal training.

llmaudio understandinglorarelevance 0.00 · engagement 0.00

Read at source ↗← all news