LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings
LongWebBench is a newly introduced benchmark designed to evaluate long-horizon webpage generation from both structural and functional perspectives, featuring 490 real-world long webpages for structural fidelity and 507 goal-oriented interaction tasks across 129 webpages. It utilizes a multi-dimensional VLM-based metric for structural coherence assessment and a DOM-augmented agent-based pipeline for functional verification, revealing that structural fidelity diminishes with webpage length and that visually plausible generations often lack support for multi-step interactions. This benchmark emphasizes the necessity for comprehensive evaluation metrics that go beyond visual similarity to include executable interactions, providing valuable insights for practitioners developing vision-language models (VLMs) for complex webpage generation.