Inference
Sentinel: Decoding Context Utilization via Attention Probing for Efficient LLM Context Compression
Sentinel is a lightweight sentence-level compression framework designed to enhance retrieval-augmented generation (RAG) by decoding how large language models (LLMs) utilize context during inference through head-wise attention patterns. It achieves up to 5× compression using a 0.5B proxy model while maintaining competitive performance against methods utilizing 7B models, without requiring dedicated compression training or autoregressive scoring. This advancement is significant for practitioners as it offers an efficient way to improve context handling in LLM applications, particularly in multilingual and out-of-domain scenarios.
context-compressionllm