ai-digest.dev
last updated 2 h ago
CodingarXiv cs.AI 9 d ago

Co-Scraper: query-aware DOM Pruning and Reusable Scraper Synthesis for Lightweight Web Data Extraction

Co-Scraper is a newly proposed two-stage framework designed for automated web data extraction, integrating a query-aware DOM pruning mechanism with stable extraction strategy induction. Utilizing the fine-tuned Qwen3-8B model, Co-Scraper achieves a state-of-the-art F1 score of 94.78% and a reuse success rate of 90.39% on the SWDE test set. This advancement is significant for practitioners as it offers a robust and efficient method for generating reusable scrapers across diverse web pages, addressing the challenges of hierarchical HTML document complexity.

web-scrapingdata-extractionscraperrelevance 0.00 · engagement 0.00
Read at source ↗← all news
Co-Scraper: query-aware DOM Pruning and Reusable Scraper Synthesis for Lightweight Web Data Extraction — AI News Digest