Coding
Co-Scraper: query-aware DOM Pruning and Reusable Scraper Synthesis for Lightweight Web Data Extraction
Co-Scraper is a newly proposed two-stage framework designed for automated web data extraction, integrating a query-aware DOM pruning mechanism with stable extraction strategy induction. Utilizing the fine-tuned Qwen3-8B model, Co-Scraper achieves a state-of-the-art F1 score of 94.78% and a reuse success rate of 90.39% on the SWDE test set. This advancement is significant for practitioners as it offers a robust and efficient method for generating reusable scrapers across diverse web pages, addressing the challenges of hierarchical HTML document complexity.
web-scrapingdata-extractionscraper