ai-digest.dev
last updated 5 h ago
AgentsarXiv cs.AI 21 h ago

How can we assess human-agent interactions? Case studies in software agent design

The paper introduces PULSE, a new framework for evaluating human-agent interactions that integrates user feedback, machine learning predictions of user satisfaction, and combines these with model-generated pseudo-labels. Implemented in the software engineering domain using the open-source agent OpenHands, PULSE evaluates agent design decisions across 15,000 users, achieving a 40% reduction in confidence intervals compared to traditional A/B testing. The findings highlight significant discrepancies between benchmark performance and real-world user satisfaction, emphasizing the need for more robust evaluation methods in LLM-powered agents.

llmhuman-agent-interactionevaluationsoftware-engineeringrelevance 0.00 · engagement 0.00
Read at source ↗← all news