CodingarXiv cs.CL — 12 d ago

A large-scale pipeline for LLM-assisted corpus annotation: variation and change in the English consider construction

The article presents a scalable pipeline for automating grammatical annotation in large natural language corpora using large language models (LLMs), specifically through a four-phase workflow that includes prompt engineering and automated batch processing. The authors demonstrate the pipeline by annotating 143,933 'consider' concordance lines from the Corpus of Historical American English (COHA) using the OpenAI API, achieving over 98% accuracy in under 60 hours. This approach highlights the potential for LLMs to facilitate large-scale data preparation tasks with minimal human intervention, opening new avenues for research while emphasizing the need to consider implementation costs and ethical implications.

corpus-annotationllmautomationrelevance 0.00 · engagement 0.00

Read at source ↗← all news