Research
Dual Dimensionality for Local and Global Attention
The paper introduces a novel approach called Distance-Adaptive Representation (DAR) for decoder-only Transformers, which allows for asymmetric dimensionality in attention mechanisms by using full-dimensional representations for local tokens and reduced-dimensional representations for distant tokens (e.g., 1/4 of the original size). This method was tested across models ranging from 70M to 410M parameters and demonstrated performance comparable to full-dimensional baselines, highlighting the potential for optimizing representational capacity in attention architectures. This approach could lead to more efficient KV cache usage during inference, offering practitioners a method to enhance model efficiency while maintaining performance.
transformersattentionlocalglobal