Daily digest — 2026-07-03

What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects

This study presents a meta-evaluation of table LLMs, replicating four models by instruction-tuning three foundation models on four datasets, resulting in 12 distinct models evaluated across 16 benchmarks. The findings indicate that the choice of base model significantly influences performance, overshadowing the effects of training data, and highlight ongoing challenges in generalization and reasoning within table modeling. These insights are critical for practitioners, as they underscore the importance of model selection in optimizing performance for table-related tasks.

arXiv cs.CL — 24 d ago · found 22 d agoResearch

CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring and Feedback

The paper presents CoTAL, an innovative approach combining Chain-of-Thought Prompting and Active Learning for formative assessment scoring using LLMs, specifically GPT-4. It integrates Evidence-Centered Design to align assessments with curriculum goals and employs a human-in-the-loop method to refine prompts and rubrics iteratively, resulting in a scoring performance improvement of up to 38.9% over baseline methods. This framework is significant for practitioners as it enhances the reliability and quality of automated scoring systems in diverse educational domains, facilitating better feedback mechanisms for both teachers and students.

arXiv cs.CL — 24 d ago · found 22 d agoTraining

From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

The paper presents a causal audit of expert importance in Mixture-of-Experts (MoE) models, specifically examining three architectures: OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, and DeepSeek-V2-Lite. The study finds that traditional observational metrics fail to predict causal expert importance, with effect sizes below Cohen's $d = 0.17$ across 60 combinations, challenging the validity of current pruning methods that rely on population-level summaries. This work emphasizes the necessity for rigorous interventional audits in interpretability practices, providing insights that may influence the development of more effective expert pruning strategies in MoE architectures.

arXiv cs.CL — 24 d ago · found 22 d agoResearch

The day in AI, distilled.

What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects

CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring and Feedback

From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

Models & Releases

Training & Optimization

Evaluation & Safety

Tools & Frameworks