Coding
Source-Grounded Data Generation for Text-to-JSON Learning
The article introduces STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a data generation pipeline designed to create structured JSON outputs from unstructured text using large language models (LLMs). Evaluations on the STAGE-Eval benchmark, which includes 851 examples, demonstrate significant performance improvements for the Qwen3-4B model, with exact match rates increasing from 31.37% to 74.27% and value accuracy rising from 45.46% to 90.69%. This advancement is crucial for practitioners as it enhances the reliability and scalability of training data for text-to-JSON tasks, facilitating better integration of unstructured data into automated systems.
text-to-jsondata generationllm