AgentsarXiv cs.CL — 15 d ago

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

The article introduces GrowthHacker, a benchmark for evaluating large language models (LLMs) and LLM-based agents in automating off-policy evaluation (OPE) for A/B testing using pre-existing logged data. It presents a two-agent framework that enhances reliability (98.1%-100% success rate) and positive-outcome rates (78%), outperforming existing methods like CrewAI, AutoGen, and Default in terms of reliability and improvement metrics. This work is significant for practitioners as it demonstrates the potential of LLMs to automate and optimize OPE processes, reducing the need for resource-intensive manual interventions in data-driven decision-making.

off-policy-evaluationllm-agentsautomationrelevance 0.00 · engagement 0.00

Read at source ↗← all news