Inference
Maybe dumb question, but how do you serve multiple users with the full context length?
The article discusses the challenge of serving multiple users with a large language model (LLM) that has a context length of 128k tokens, specifically using the llama.cpp framework. It highlights that llama.cpp currently allows for shared access to the 128k context across users rather than providing each user with their own full context length, raising questions about how to effectively manage context for parallel users. This is significant for practitioners as it addresses limitations in multi-user scenarios, prompting considerations for architecture adjustments or alternative implementations to achieve full context capabilities per user.
context_lengthllm_serving