Research
Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference
The paper presents a statistical framework for using large language models (LLMs) as surrogates in A/B testing, addressing when treatment effects estimated from LLM outcomes can accurately reflect those from human subjects. It highlights that while calibrating LLM outcomes to human outcomes can identify average treatment effects under weaker conditions than distributional equivalence, the stochastic nature of LLMs introduces bias and variance that can be mitigated by averaging multiple draws. This work emphasizes the necessity of human experiments for validating novel interventions, even as it provides diagnostics for assessing the effectiveness of LLMs in historical contexts.
ab-testingllmframeworkcausalinference