Daily digest — 2026-06-14

RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning

RoboGPT-R1 is introduced as a two-stage fine-tuning framework aimed at enhancing robot task planning through reinforcement learning (RL) and supervised training. It utilizes the Qwen2.5-VL-3B model and demonstrates a 21.33% improvement over GPT-4o-mini and a 20.33% increase over Qwen2.5-VL-7B on the EmbodiedBench benchmark, addressing challenges in long-horizon manipulation tasks. This framework is significant for practitioners as it combines RL with a rule-based reward function to improve visual-spatial reasoning and action sequence consistency in real-world robotic applications.

arXiv cs.AI — 50 d agoTraining

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

ChartAgent is a newly introduced multimodal framework designed for visually grounded reasoning in chart-based question answering, explicitly addressing the challenges posed by unannotated charts. It utilizes an iterative decomposition approach to break down queries into visual subtasks, employing specialized actions like annotation and cropping, and achieves state-of-the-art performance with up to a 16.07% absolute improvement on benchmarks such as ChartBench and ChartX, particularly excelling in unannotated, numerically intensive queries. This framework not only enhances performance across various chart types and reasoning complexities but also serves as a plug-and-play solution for existing LLMs, making it a significant advancement for practitioners in the field of visual reasoning with AI.

arXiv cs.AI — 50 d agoAgents

How can we assess human-agent interactions? Case studies in software agent design

The paper introduces PULSE, a new framework for evaluating human-agent interactions that integrates user feedback, machine learning predictions of user satisfaction, and combines these with model-generated pseudo-labels. Implemented in the software engineering domain using the open-source agent OpenHands, PULSE evaluates agent design decisions across 15,000 users, achieving a 40% reduction in confidence intervals compared to traditional A/B testing. The findings highlight significant discrepancies between benchmark performance and real-world user satisfaction, emphasizing the need for more robust evaluation methods in LLM-powered agents.

arXiv cs.AI — 50 d agoAgents

The day in AI, distilled.

RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

How can we assess human-agent interactions? Case studies in software agent design

Models & Releases

Research

Tooling & Open Source

Safety