Agents
DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems
DSAEval is a newly introduced benchmark designed to evaluate LLM-based data agents on 641 real-world data science problems utilizing 285 diverse datasets, including both structured and unstructured data. Key features of DSAEval include Multimodal Environment Perception, Multi-Query Interactions, and Multi-Dimensional Evaluation, which collectively enhance the assessment of agent performance. Results indicate that while agents like Claude-Sonnet-4.5 excel overall, challenges persist in unstructured data tasks, highlighting the need for further advancements in multimodal capabilities for practitioners developing AI-driven data science solutions.
data scienceevaluationllm