Research
The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data
The Stanford EDGAR Filings Dataset (SEFD) has been released as an open resource for training large language models (LLMs) on financial data, consisting of a layout-faithful reconstruction of SEC filings in MultiMarkdown format. The initial public snapshot, SEFD-v1, contains 152 billion tokens, with a larger archive of 18.5 million filings estimated at 550 billion tokens, ensuring a token-efficient dataset with minimal overlap with existing corpora. This dataset supports advanced financial reasoning and includes two benchmarks, EDGAR-Forecast for numerical forecasting and EDGAR-OCR for complex table transcription, which are crucial for practitioners developing LLMs in the financial domain.
financialdatapretrainingllm