AgentsarXiv cs.AI — 47 d ago

How can we assess human-agent interactions? Case studies in software agent design

The paper introduces PULSE, a new framework for evaluating human-agent interactions that integrates user feedback, machine learning predictions of user satisfaction, and combines these with model-generated pseudo-labels. Implemented in the software engineering domain using the open-source agent OpenHands, PULSE evaluates agent design decisions across 15,000 users, achieving a 40% reduction in confidence intervals compared to traditional A/B testing. The findings highlight significant discrepancies between benchmark performance and real-world user satisfaction, emphasizing the need for more robust evaluation methods in LLM-powered agents.

llmhuman-agent-interactionevaluationsoftware-engineeringrelevance 0.70 · engagement 0.00

Read at source ↗← all news