ResearcharXiv cs.CL — 16 d ago

Characterizing Narrative Content in Web-scale LLM Pretraining Data

The study characterizes narrative features in Dolma, a 3-trillion-token open pretraining corpus, by introducing a framework that operationalizes narrative elements into 11 dimensions. The authors fine-tune NarraBERT, a RoBERTa-based model, for narrative prediction and apply it to create the new dataset NarraDolma from 3 million passages. This work highlights measurable narrative structures in large-scale data and emphasizes the uneven distribution of narrative qualities across sources, providing insights for practitioners on the implications of data composition for narrative reasoning tasks in LLMs.

narrativeLLMpretrainingrelevance 0.00 · engagement 0.00

Read at source ↗← all news