Multimodal
AuRA: Internalizing Audio Understanding into LLMs as LoRA
AuRA introduces a novel method for integrating audio understanding directly into large language models (LLMs) via a lightweight audio embedding layer and layer-wise distillation from an ASR encoder to a LoRA-adapted LLM. This approach allows for tighter speech-language joint modeling and efficient parallel inference, outperforming traditional cascaded systems and large-scale multimodal models on various benchmarks. Practitioners can leverage AuRA to enhance LLM capabilities with audio inputs without incurring the costs of extensive multimodal training.
llmaudio understandinglora