Training
Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts
The paper presents a novel approach to embedding model routing in recommendation systems by framing it as an adversarial contextual linear bandit problem with low-rank experts. It introduces the Hypentropy Policy Gradient (HPG) algorithm, which achieves a linearized policy regret of $\tilde{\mathcal O}(s\sqrt{M T})$, effectively managing the challenges of incomplete information and structural misspecification. This work is significant for practitioners as it offers a structured and efficient method for dynamically routing queries to multiple models, enhancing the performance of recommendation systems in realistic scenarios.
embeddingbanditsonline learning