Agents
How can we assess human-agent interactions? Case studies in software agent design
The paper introduces PULSE, a new framework for evaluating human-agent interactions that integrates user feedback, machine learning predictions of user satisfaction, and combines these with model-generated pseudo-labels. Implemented in the software engineering domain using the open-source agent OpenHands, PULSE evaluates agent design decisions across 15,000 users, achieving a 40% reduction in confidence intervals compared to traditional A/B testing. The findings highlight significant discrepancies between benchmark performance and real-world user satisfaction, emphasizing the need for more robust evaluation methods in LLM-powered agents.
llmhuman-agent-interactionevaluationsoftware-engineering