InferencearXiv cs.AI — 9 d ago

Communication-Efficient Verifiable Attention for LLM Inference

The paper introduces Communication-efficient TEE-GPU Attention (\textsc{VeriAttn}), a novel approach for enhancing the integrity and efficiency of LLM inference by leveraging Trusted Execution Environments (TEE) alongside GPU processing. \textsc{VeriAttn} optimizes attention computation by offloading both linear and non-linear tasks to the GPU while using TEE solely for verification, achieving significant performance improvements with 2.60-3.38× acceleration during prefill and 3.86-5.42× during decoding on an Intel TDX platform. This advancement is crucial for practitioners as it addresses the challenges of computational integrity and resource overhead in deploying Transformer-based models in untrusted environments.

llmverifiableattentionrelevance 0.00 · engagement 0.00

Read at source ↗← all news