AgentsarXiv cs.AI — 4 d ago

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

The paper introduces a layer-isolated evaluation framework for production LLM agents, allowing for detailed analysis of task success by decomposing the agent into specific functional layers such as ontology, intent, and safety. The framework utilizes a deterministic, no-LLM test harness that executes 238 cases across 23 slices, achieving a runtime of approximately 2.39 seconds per test, and demonstrates how regression injection can localize faults to specific layers, revealing issues masked by aggregate performance metrics. This approach provides practitioners with a more granular method for evaluating LLM performance, enhancing regression detection and ensuring comprehensive coverage of individual components in production systems.

llmevaluationtestingrelevance 0.00 · engagement 0.00

Read at source ↗← all news