InferencearXiv cs.CL — 8 d ago

Benchmarking Large Language Models for Safety Data Extraction

This study benchmarks four Large Language Models (LLMs)—Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B—for automated extraction of structured information from Safety Data Sheets (SDS), evaluating their performance across zero-shot, few-shot, and chain-of-thought prompting strategies. The results indicate that text-based extraction methods consistently outperform multimodal approaches, with Gemini 1.5 Pro achieving the highest accuracy of 84%, though none of the models met the 90% accuracy threshold needed for reliable industrial deployment. This highlights the current limitations of general-purpose LLMs in safety-critical applications and underscores the need for task-specific fine-tuning and improved training methodologies.

data extractionsafety data sheetsbenchmarkingrelevance 0.00 · engagement 0.00

Read at source ↗← all news