Inference
How Inference Compute Shapes Frontier LLM Evaluation
The paper evaluates 12 frontier language models across seven challenging benchmarks, highlighting the impact of inference compute on performance. Key findings indicate that larger token budgets significantly enhance model performance, while fixed-budget evaluations may underestimate model capabilities as they advance. The study suggests that evaluations should report performance as a function of inference compute and clarify protocol choices to better reflect model capabilities, particularly in critical applications.
evaluationcomputelanguage models