Research
Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering
The study introduces a hop-count taxonomy to predict large language model (LLM) performance in electronic health record (EHR) question answering, revealing that models show a consistent decline in accuracy as the number of required inferential steps increases. Evaluating three models (Claude Sonnet, GPT-4o, and GPT-5.4), the research finds that accuracy drops significantly from 30.6% to 17.6% for Claude Sonnet as hop count increases from 1 to 4, indicating limitations in transformer compositionality. This finding highlights the importance of understanding reasoning depth in clinical AI applications, as it directly informs risk stratification and deployment strategies for LLMs in healthcare settings.
clinical AIEHRtransformer limits