Agents — AI news — AI News Digest

How to Design an OpenHarness Style Agent Runtime with Tools, Memory, Permissions, Skills, and Multi-Agent Coordination

The article presents a tutorial on constructing an OpenHarness-style agent runtime, detailing the implementation of core components such as tool use, typed tool schemas, permissions, lifecycle hooks, memory management, skills, context compaction, retry logic, cost tracking, and multi-agent coordination. It emphasizes an open control flow, allowing practitioners to experiment with the system without reliance on external APIs or infrastructure. This approach provides valuable insights for AI engineers looking to build flexible and scalable agent systems.

MarkTechPost32 d agofound 12 d ago#agent_runtime#tools#multi-agent

Nex-N2-Mini-Ultra-Uncensored-Heretic Is Out Now, an Agentic Model With Agentic Thinking Now Uncensored With 5/100 Refusals and 0.0020 KLD, Available in Safetensors and GGUF Formats!

The Nex-N2-Mini-Ultra-Uncensored-Heretic model has been released, featuring 35 billion parameters and achieving a refusal rate of 5/100 with a Kullback-Leibler Divergence (KLD) of 0.0020. It is available in both Safetensors and GGUF formats, and utilizes Heretic version 1.2.0, which has shown better performance in terms of KLD compared to the newer version. This model's reduced censorship and enhanced performance metrics make it significant for practitioners looking to implement more flexible and capable LLMs in their applications.

Reddit r/LocalLLaMA32 d agofound 12 d ago#nex-n2#agent#model

Qwen-AgentWorld-35B-A3B for Coding?

The Qwen-AgentWorld-35B-A3B model has been benchmarked, achieving an overall score of 56.39 and notable performance in specific categories such as Search (36.69) and SWE (65.63). This model, part of the Qwen series, provides important insights for practitioners focused on coding tasks, as it demonstrates competitive capabilities in software engineering contexts compared to other models like Qwen3.5-397B-A17B. Understanding these benchmarks can guide developers in selecting appropriate models for their AI applications.

Reddit r/LocalLLaMA32 d agofound 12 d ago#qwen#coding#model

Claude Tag embeds Anthropic's AI in Slack, already writes 65 percent of internal code, company says

Anthropic has released Claude Tag, an integration that allows teams to utilize its AI within Slack by tagging @Claude for task assignments. This tool reportedly generates 65% of the internal code for Anthropic's product team, highlighting its potential to enhance productivity and streamline coding workflows in collaborative environments. For practitioners, this integration demonstrates the increasing utility of AI in real-time coding assistance and team collaboration.

The Decoder33 d agofound 12 d ago#claude#slack#internal_code

Nous Research Adds /learn to Hermes Agent’s Skills System, Capturing Workflows as Slash Commands Without Hand-Writing SKILL.md

Nous Research has introduced the /learn command to the Hermes Agent Skills System, enabling the automatic generation of a standards-compliant SKILL.md from various sources such as local directories, document URLs, and past conversations. This enhancement allows the live agent to autonomously source content and create skills without manual input or a separate ingestion engine, streamlining the workflow for practitioners developing with LLMs by simplifying the skill creation process and reducing potential errors in documentation.

MarkTechPost33 d agofound 12 d ago#hermes#skills_system#workflows

Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments

Qwen has released the Qwen-AgentWorld-35B-A3B, a 35 billion parameter mixture of experts (MoE) model that activates approximately 3 billion parameters per token. This model is designed to simulate various environments, including MCP, terminal, software engineering, Android, web, and OS interactions, by predicting the next state based on an agent's actions. It is particularly relevant for practitioners focused on agent training, offline evaluation, and the development of synthetic environments for tool-use workflows.

Reddit r/LocalLLaMA33 d agofound 12 d ago#qwen#model#agent

TACTFUL: Tactile-Driven Exploration For Object Localization and Identification in Confined Environments

TACTFUL is a novel tactile exploration framework designed for multi-fingered robots, enabling vision-free object localization and identification in confined environments. It employs a single policy trained on real hardware, achieving a 77% success rate and a 0.015 m average reconstruction error through a dynamic reward schedule that balances global exploration and local refinement. This approach highlights the potential of tactile sensing as a primary modality for object-level reasoning, offering significant implications for practitioners developing autonomous robotic systems.

arXiv cs.AI33 d agofound 10 d ago#tactile#robotics#exploration

Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

The article presents an evaluation of LAMBDA, a multi-agent data-analysis system, utilizing a three-layer human-AI grading cascade on 153 numerical QRData tasks from DSGym. The grading system achieved 100% precision with the strict grader and a 97% recall for the lenient grader against human labels, demonstrating effective strategies for distinguishing genuine outputs from grading artifacts. This work is significant for practitioners as it highlights the importance of hybrid grading approaches and the impact of iterative nudging on grading success, which can enhance the reliability of automated assessments in complex data analysis tasks.

arXiv cs.AI33 d agofound 10 d ago#data analysis#grading#multi-agent systems

SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation

SAFARI (Scaling long-horizon Agentic Fault AttRibution via active Investigation) is a new framework designed to enhance fault attribution in autonomous agents by utilizing a tool-augmented diagnostic loop, which allows for reading and searching trajectory segments alongside a persistent Short-Term Memory (STM). This approach decouples diagnostic accuracy from the limitations of LLM context windows, achieving a 20% improvement on the Who&When dataset and a 19% improvement on the TRAIL GAIA subset within specified token budgets. SAFARI maintains a precision of 0.58 even when diagnosing faults located 5x beyond the model's native context window, addressing a critical challenge in multi-step, multi-agent task execution.

arXiv cs.AI33 d agofound 10 d ago#multi-agent systems#fault attribution#diagnostics

ASALT: Adaptive State Alignment for Lateral Transfer in Multi-agent Reinforcement Learning

The paper introduces ASALT (Adaptive State Alignment for Lateral Transfer), a novel method in multi-agent reinforcement learning (MARL) that addresses the challenge of transferring knowledge between source and target domains with mismatched state-space dimensionalities. ASALT utilizes observation-level and state-level adapters to map observations and states into a shared embedding space, enhancing sample efficiency and global returns in cooperative environments while reducing negative transfer. This advancement is significant for practitioners as it facilitates more effective policy transfer across heterogeneous domains, potentially improving the performance of MARL systems in diverse applications.

arXiv cs.AI33 d agofound 10 d ago#reinforcement#transfer#multi-agent

Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation

The paper presents a reinforcement learning framework for Computer-Use Agents (CUAs) that utilizes autonomous vision-language evaluation as a scalable supervision signal, addressing the challenge of sparse reward signals in open-ended desktop environments. By modeling the imperfect feedback from a Vision-Language Model as a noisy binary reward channel, the authors implement a noise-corrected reward estimator for Proximal Policy Optimization, resulting in an average improvement of 12.6 percentage points in success rates over zero-shot performance. This approach highlights the potential of autonomous evaluation as a viable reward mechanism for training RL agents in graphical user interfaces, particularly when noise is accounted for in the reward estimation process.

arXiv cs.AI33 d agofound 12 d ago#reinforcement-learning#gui-agents#evaluation

Reward-Centered ReST-MCTS: A Robust Decision-Making Framework for Robotic Manipulation in High Uncertainty Environments

The paper introduces Reward-Centered ReST-MCTS, a decision-making framework designed to enhance Monte Carlo tree search (MCTS) for robotic manipulation in uncertain environments. It decomposes feedback into multiple channels—rule, heuristic, neural, and value estimation—allowing for improved search bias and robustness against challenges such as sparse rewards and noisy transitions. This framework is significant for AI practitioners as it provides a structured approach to improving decision-making in high-uncertainty scenarios without necessitating a fully differentiable policy.

arXiv cs.AI33 d agofound 10 d ago#robotics#decision_making#MCTS

Offline Reinforcement Learning for Warehouse SLAM Throughput Control

The article presents an offline reinforcement learning framework aimed at optimizing SLAM (Scan/Label/Apply/Manifest) throughput control in warehouse environments. Key technical details include the use of a history-informed state representation, action space abstraction for delayed-impact control, and a reward function that incorporates both upstream and downstream metrics. The framework integrates multiple offline RL algorithms, with empirical results showing that the CQL policy improves system health by 22.97% and reduces average throttling duration by 3.18%, highlighting the effectiveness of offline RL in enhancing operational efficiency in warehouse settings.

arXiv cs.AI33 d agofound 10 d ago#reinforcement_learning#warehouse#slam

A Unified Framework for Runtime Verification and Model-Based Diagnosis in LOLA

The article introduces a unified framework that integrates runtime verification and model-based diagnosis using the stream specification language LOLA. This framework allows for continuous online fault localization and detection by encoding system descriptions, health states, and observations within a single formalism, effectively handling both time-invariant and transient faults alongside nondeterministic observations. This development is significant for practitioners as it streamlines the fault management process in systems, reducing the need for separate toolchains and enhancing real-time diagnostics in AI applications.

arXiv cs.AI33 d agofound 10 d ago#runtime verification#diagnosis#lola

Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control

The article presents a hierarchical multi-agent reinforcement learning (MARL) framework that integrates constraint manifold control to enforce hard safety constraints while enabling coordination among agents. This approach provides theoretical safety guarantees and achieves stationary learning dynamics, leading to stable and efficient training. Empirical results demonstrate competitive performance with nearly perfect safety rates, making it significant for practitioners focused on safety-critical applications in multi-agent systems.

arXiv cs.AI33 d agofound 12 d ago#multi-agent#reinforcement-learning#safety

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

The paper introduces AgenticInterpBench, a benchmark comprising 84 semi-synthetic transformer circuits with 163 component-level annotations, aimed at assessing language model (LM) agents' ability to explain identified circuits in mechanistic interpretability. It presents HyVE (Hypothesize, Validate, Explain), an agentic explainer that utilizes an iterative process to produce detailed explanations, demonstrating that while various LM backbones can generate useful insights, challenges in the validation phase hinder consistent performance. This work is significant for practitioners as it highlights the potential of LMs in circuit explanation while emphasizing the need for robust validation mechanisms to enhance interpretability in AI systems.

arXiv cs.AI33 d agofound 12 d ago#mechanistic interpretability#language model#agents#explanation

ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents

ATRIA is a multi-agent ECG reporting system designed to enhance clinical ECG report generation by decoupling interpretation and reporting, allowing for iterative context integration and bidirectional editing. It binds report claims to supporting evidence, flags unsupported statements, and enables clinicians to verify and revise findings, thereby reducing error propagation. Its architecture leverages existing ECG analysis models and is available as a cloud-based web service, making it ready for immediate deployment in clinical settings.

arXiv cs.AI33 d agofound 12 d ago#ecg-reporting#multi-agent-systems

When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments

The article introduces StockAgent, a multi-agent AI system utilizing large language models (LLMs) to simulate stock trading behaviors in response to external factors such as macroeconomic conditions and policy changes. StockAgent addresses the issue of test set leakage common in previous AI trading simulations, allowing for a more accurate analysis of trading behaviors and profitability under realistic market conditions. This framework provides insights that can enhance LLM-based investment strategies and stock recommendations, making it significant for practitioners in finance and AI.

arXiv cs.AI33 d agofound 10 d ago#llm#stock_trading#multi-agent

Maestro Order: A Model-Agnostic Orchestration Harness

Maestro Order is introduced as a model-agnostic orchestration harness designed to enhance the reliability of language models by integrating four structural primitives: decompose, ensemble, verify, and recurse, alongside a budget-aware controller for compute allocation. The architecture operates by treating models as black-box solvers and employs a verifier ensemble to improve reliability, achieving significant improvements in reliability metrics (e.g., from 0.55 to 0.999) through strategic verification and voting mechanisms. This framework is crucial for practitioners as it provides a systematic approach to mitigate hallucinations in AI systems, optimizing resource usage while ensuring high reliability in problem-solving tasks.

arXiv cs.AI33 d agofound 10 d ago#orchestration#model_agnostic

LemonHarness Technical Report

LemonHarness is a newly announced integrated execution framework designed for long-horizon language model agents, establishing explicit execution boundaries to manage state changes during multi-step tasks. It constrains operations like file writes and artifact generation within a defined workspace, enhancing tracking and execution stability. Benchmark results show that LemonHarness_GPT-5.3-CodeX achieved 84.49% accuracy on Terminal-Bench 2.0, while the framework paired with the more powerful GPT-5.5 increased accuracy to 86.52%, highlighting its potential for improving the reliability of AI agents in complex workflows.

arXiv cs.AI33 d agofound 12 d ago#LLM#execution framework#workspace

Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

The article proposes a new model for e-commerce that leverages micro-transaction markets for verified product information, shifting the focus from product matching to acquiring trustworthy data. It outlines an architecture for this system where autonomous buyer agents can pay small amounts to access detailed product information, such as service histories and test reports, thus promoting genuine product quality and competitive pricing. The authors highlight key NLP challenges that arise from this model, including cost-optimal information acquisition and privacy-preserving persona modeling, suggesting these areas warrant further research and development in the field.

arXiv cs.AI33 d agofound 10 d ago#e-commerce#agents#micro-transactions

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis is a self-evolving agent system that utilizes a hierarchical dual-representation memory to bridge text and code memory, allowing for improved experience reuse. A controlled study reveals that text and code representations offer complementary benefits, leading to Metis's architecture that organizes experiences into execution plans and callable tools. Evaluated on the AppWorld benchmark, Metis demonstrates up to 20.6% improvement in task accuracy and 22.8% reduction in execution costs compared to the ReAct system, highlighting its efficiency and effectiveness for practitioners developing interactive agents.

arXiv cs.AI33 d agofound 10 d ago#self-evolving agents#memory#text and code

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

FlowR2A introduces a novel approach to multimodal driving planning by integrating scoring-based and anchor-based methods through a generative model that learns reward-conditioned action distributions. Utilizing a flow-matching decoder, it leverages dense trajectory-reward pairs to enhance the correlation between actions and their outcomes across multiple dimensions, including safety and compliance. This model achieves state-of-the-art performance on NAVSIM v1 and v2 benchmarks, offering high-quality proposals and improved sampling control, which is crucial for practitioners developing robust AI-driven driving systems.

arXiv cs.AI33 d agofound 12 d ago#driving planning#reward distribution#multimodal

Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems

The paper introduces Agentic long-term performance optimization (Agentic-LTPO), a bilevel optimization framework aimed at improving adaptive physical layer configurations in response to changing network policies and real-time constraints. It employs a multi-agent decision process for upper-level configuration generation and a closed-form beamformer for lower-level optimization, achieving a 57.2% improvement in long-term performance over traditional methods in a cell-free MIMO beamforming scenario. This approach is significant for practitioners as it enhances system adaptability and efficiency in dynamic network environments.

arXiv cs.AI33 d agofound 12 d ago#optimization#policy-driven#agentic-ai

DeepBD: A Grounded Agentic Workflow for Variant Prioritization and Diagnosis of Genetic Birth Defects

DeepBD is a novel workflow designed for the prioritization and diagnostic interpretation of genetic variants associated with birth defects. It integrates a pretrained evidence engine that evaluates patient-specific variant scores using structured rule evidence and phenotype-conditioned biological context, achieving Recall@1/3/5/10 scores of 0.658/0.882/0.912/0.929 on a benchmark of 18,622 cases, outperforming existing tools like Exomiser and DeepRare. This approach is significant for practitioners as it enhances the accuracy of variant prioritization by combining various evidence sources and LLM-assisted review, thereby improving diagnostic outcomes in clinical genetics.

arXiv cs.AI33 d agofound 10 d ago#llm#genetic#agents#workflow

NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

The article presents "NoContactNoWorries," a transformer-based multimodal framework designed to estimate binary contact states in robotic manipulation by integrating RGB-D vision with proprioceptive data. This approach addresses the limitations of traditional tactile sensors by enabling robots to infer contact through visual cues, thereby supporting downstream tasks such as in-hand object reorientation. Experimental validation in both simulation and real-world scenarios demonstrates the model's effectiveness and potential for enhancing dexterous manipulation capabilities in robotics.

arXiv cs.AI33 d agofound 10 d ago#robotics#manipulation#contact estimation

E-MRL: Cross-view Aligned Evidence-driven Multimodal Reinforcement Learning for Reliable 3D Tumor Analysis

The article introduces E-MRL (Evidence-driven Multimodal Reinforcement Learning), a novel framework designed to enhance 3D tumor analysis by addressing visual hallucinations in Vision-Language Models. E-MRL operates as a Markov Decision Process focusing on "diagnosis-localization-verification" and incorporates a cross-view consistency reward to ensure semantic alignment between diagnostic reports and visual evidence from 3D CT data. Experimental results on large-scale datasets show that E-MRL outperforms traditional Supervised Fine-Tuning and Reinforcement Learning approaches, improving diagnostic accuracy and reliability for practitioners in medical imaging and AI-driven diagnostics.

arXiv cs.AI33 d agofound 10 d ago#reinforcement_learning#multimodal#medical

LLM-MINE: Large Language Model based Alzheimer's Disease and Related Dementias Phenotypes Mining from Clinical Notes

LLM-MINE is a proposed framework leveraging Large Language Models for the automatic extraction of Alzheimer's Disease and Related Dementias (ADRD) phenotypes from unstructured clinical notes. The framework demonstrated superior performance in phenotype clustering, achieving an Adjusted Rand Index (ARI) of 0.290 and Normalized Mutual Information (NMI) of 0.232, significantly surpassing traditional biomedical Named Entity Recognition (NER) and dictionary-based methods. This advancement is crucial for practitioners as it enhances the ability to mine clinically relevant signals from electronic health records, facilitating early detection and staging of ADRD.

arXiv cs.AI33 d agofound 10 d ago#llm#phenotype#extraction

SP-Mind: An Autonomous Reasoning Agent for Spatial Proteomics Analysis

SP-Mind is introduced as the first autonomous AI agent specifically designed for spatial proteomics analysis, streamlining the process from raw multiplexed tissue imaging to phenotype discovery without requiring task-specific fine-tuning. It utilizes expert-curated biological analysis skills and specialized computational tools, and its performance is rigorously evaluated using SP-Bench, a benchmark consisting of 102 tasks across 18 categories, where SP-Mind demonstrates state-of-the-art results compared to existing biomedical agent baselines. This development is significant for practitioners as it enhances scalability and reproducibility in spatial proteomics research, facilitating more efficient analysis workflows in precision medicine.

arXiv cs.AI33 d agofound 12 d ago#proteomics#AI agent#workflow

Bayesian control for coding agents

A new approach to orchestration in coding agents using Bayesian control has been proposed, where a Bayesian controller dynamically manages tool-use decisions based on a belief over candidate correctness. This method was evaluated across six LLM generators and nine coding benchmarks, demonstrating superior performance in scenarios where verification is costly and critics provide informative yet imperfect feedback. The belief state produced by this controller offers an interpretable correctness score that surpasses traditional metrics like token probability and raw tool success, enhancing uncertainty quantification for practitioners in AI development.

arXiv cs.AI33 d agofound 10 d ago#coding#bayesian#agents

Toward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines

The article presents a framework for evolving expert "LLM + script" workflows into adaptable systems through a reversible migration path, termed the Strangler-Fig approach. This framework introduces a three-tier convertibility taxonomy (A/B/C) that assesses and routes legacy workflows into composable, typed, and auditable stages, addressing the need for dynamic adaptation based on feedback. This development is significant for practitioners as it provides a structured method to modernize existing workflows, enhancing their flexibility and responsiveness in AI applications.

arXiv cs.AI33 d agofound 10 d ago#workflow#llm#migration

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

The study evaluates the effectiveness of exact-match retrieval recall as a measure of policy utility in long-horizon tool-use agents, specifically using Qwen2.5-3B/7B classifiers within the tau-bench framework. It demonstrates that while a compact structured state improves macro-F1 scores by 0.13-0.17, the retrieval of policy clauses does not significantly differ from gold clauses in terms of classification performance, suggesting that reliance on exact-match recall may misrepresent the utility of retrieved policies. This finding emphasizes the need for practitioners to consider integrating retrieved policies in the classification loop rather than depending solely on recall metrics for evaluating retriever performance.

arXiv cs.AI33 d agofound 10 d ago#policy_signal#tool_use

Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback

Themis is a newly announced explainable AI framework designed for Reinforcement Learning (RL) that integrates human feedback to enhance safety and transparency. It supports over 200 environments and allows for easy configuration of experiments, demonstrating the ability to train reward models that align closely with true reward signals based on human preferences. This framework is significant for practitioners as it provides a scalable, user-friendly platform for conducting RL experiments with large participant groups while ensuring robust alignment and explainability, addressing critical challenges in RL safety.

arXiv cs.AI33 d agofound 10 d ago#reinforcement learning#human feedback#explainable ai

OmniPath: A Multi-Modal Agentic Framework for Auditing Wheelchair Accessibility

OmniPath is a newly announced framework designed to enhance wheelchair accessibility by integrating OpenStreetMap's network topology with high-density aerial LiDAR data to produce a detailed 3D model of pedestrian environments. The system analyzes surfaces in 0.5 meter increments to quantify physical friction points and assess compliance with ADA standards, categorizing hazards based on severity with an F1-score of 0.60 for severe and 0.58 for critical issues. This proactive auditing approach allows for the identification of accessibility challenges, transforming static maps into dynamic, actionable data for wheelchair users.

arXiv cs.AI33 d agofound 12 d ago#auditing#accessibility#environmental analysis

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

This article presents a reinforcement learning approach for optimizing real-time event triggering at the Large Hadron Collider (LHC), addressing the limitations of static, hand-tuned trigger menus. The authors adapt Group-Filtered Policy Optimization (GFPO) for streaming control, achieving significant improvements in signal efficiency and in-tolerance rates for both total transverse energy and anomaly-detection triggers, with gains of up to 56% in real collision data without fine-tuning. This work is significant as it demonstrates the first application of RL for trigger control in real LHC data, potentially enhancing the efficiency of data collection in high-energy physics experiments.

arXiv cs.AI33 d agofound 10 d ago#reinforcement_learning#large_hadron_collider

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

BioMedArena is an open-source toolkit designed to enhance the reproducibility of deep research agents in biomedical applications by standardizing evaluation processes across 166 benchmarks and 75 tools. It decouples six layers of evaluation, allowing for easy integration of new models with minimal engineering effort, and includes six agent harnesses and context-management strategies that improve performance across 12 backbones, achieving an average improvement of 15.01 percentage points over prior state-of-the-art results on eight benchmarks. This toolkit is significant for practitioners as it streamlines the evaluation and comparison of biomedical AI models, facilitating more consistent and reliable research outcomes.

arXiv cs.AI33 d agofound 10 d ago#open_source#agents#biomedical

From Task-Guided Conversational Graphs to Goal-Oriented Dialogue Runtimes

The paper introduces the Goal-Oriented Dialogue Runtime (GODR), a conceptual framework designed to enhance conversational continuity in complex, multi-domain interactions involving interdependent objectives. GODR treats goals, task frames, and lifecycle states as first-class runtime objects, enabling better management of suspended, resumed, or invalidated goals, and is intended to work alongside existing orchestration frameworks rather than replace them. This framework is significant for practitioners as it addresses the challenges of maintaining conversational coherence in sophisticated dialogue systems, paving the way for more robust multi-agent interactions.

arXiv cs.AI33 d agofound 10 d ago#dialogue_systems#goal_oriented#llm

The Latent Bridge: A Continuous Slow-Fast Channel for Real-Time Game Agents

The paper introduces the Latent Bridge, a novel continuous communication channel that enhances the interaction between a slow reasoning VLM (Qwen3-VL-8B-Thinking) and a fast reactive VLM (MiniCPM-o 4.5) by projecting the slow model's residuals into the fast model's input-embedding space, eliminating the need for text round-trips. Evaluated on 7 Atari games and a driving domain (MetaDrive), the Latent Bridge outperforms the traditional Text Bridge in several cases, notably improving performance in MsPacman by 57% and RoadRunner by 28%. This development is significant for practitioners as it offers a method to optimize real-time decision-making in AI agents, particularly in environments where latency and planning quality are critical.

arXiv cs.AI33 d agofound 12 d ago#game-agents#real-time#planning

PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation

PixJail is a newly proposed framework designed for reproducible evaluation of Text-to-Image (T2I) jailbreak techniques, addressing the challenges of pipeline-level testing across multiple stages such as prompt transformation and safety filtering. The framework constructs paper-specific attack modules and evaluation pipelines, achieving an average reproduction error of 2.1% across eleven T2I jailbreak methods. This tool is significant for AI practitioners as it streamlines the reproduction process and enhances the reliability of benchmark comparisons in the rapidly evolving field of T2I jailbreaks.

arXiv cs.AI33 d agofound 10 d ago#text-to-image#jailbreak#evaluation#agents

Governed Shared Memory for Multi-Agent LLM Systems

The paper introduces a framework for governed shared memory in multi-agent LLM systems, addressing key issues such as unauthorized leakage and stale data propagation through defined primitives like scoped retrieval and provenance tracking. Implemented in MemClaw and evaluated with ArgusFleet, the system achieved 100% accuracy in provenance reconstruction and optimized write-to-visible latency to a single search round-trip, while revealing architectural challenges like asymmetric scope enforcement and pipeline ordering conflicts. This work underscores the necessity of explicit systems-level abstractions for effective multi-agent memory management in production environments, highlighting the importance of real-world evaluations to identify potential failures.

arXiv cs.AI33 d agofound 12 d ago#multi-agent-systems#memory-management

Emergent Relational Order in LLM Agent Societies: From Collective Affect to Authority Stratification

The article introduces CAREB-MAS, a multi-agent framework designed to explore long-term social structures in agent societies using principles from Affect Control Theory and Social Identity Theory. The framework enables agents to develop egocentric identities and interact based on minimal protocols, leading to the emergence of five key phenomena associated with Differential Order, including stable labor specialization and emergent relational authority. This research highlights the potential of LLM-based simulations to provide insights into social dynamics and structures, which is crucial for practitioners aiming to model complex social interactions in AI systems.

arXiv cs.AI33 d agofound 10 d ago#multi_agent_systems#social_dynamics#llm

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

A new benchmark study evaluates the performance of graphical user interface (GUI) agents versus command-line interface (CLI) agents in executing software tasks across 440 desktop tasks. The strongest GUI agent achieved a 59.1% full pass rate, while the best original-skill CLI agent reached 48.2%, with skill augmentation improving CLI success to 69.3%. This research highlights that GUI agents struggle with long-horizon workflows due to grounded interaction limitations, whereas CLI agents face challenges related to skill coverage and scalability, providing insights for practitioners developing AI agents in diverse execution environments.

arXiv cs.AI33 d agofound 12 d ago#gui#cli#benchmark

Evolving Programmatic Skill Networks

The article introduces the Programmatic Skill Network (PSN), a framework for continual skill acquisition in embodied environments that utilizes large language models to create executable symbolic programs. Key mechanisms include structured fault localization, maturity-aware optimization, and canonical structural refactoring, which enhance skill stability and adaptability. Experiments conducted in MineDojo and Crafter show that PSN achieves effective skill reuse and generalization, highlighting its potential for advancing AI agents in dynamic task environments.

arXiv cs.AI33 d agofound 10 d ago#skill-acquisition#agents

MuTRAP: Multi-trigger Trojans Attacking Robot Task Planning Systems

MuTRAP is introduced as the first multi-trigger Trojan attack targeting LLM-assisted robot task planning systems. It leverages a method that injects backdoors using a small set of task-specific parameters while optimizing multiple-trigger words for various robotic applications, demonstrating vulnerabilities in current LLM-based planners. This research highlights critical security implications for practitioners working with LLMs in robotics, emphasizing the need for enhanced security measures in AI-driven task planning.

arXiv cs.AI33 d agofound 10 d ago#robotics#task_planning#security

Engineering Reliable Autonomous Systems: Challenges and Solutions

The workshop report from the "Engineering Reliable Autonomous Systems" (ERAS) held in June 2024 outlines key challenges and solutions in the field of autonomous systems engineering. It identifies critical areas such as verification and validation techniques, real-world engineering practices, and safe software architectures, culminating in a catalogue of challenges and proposed pathways for addressing them. This roadmap is significant for practitioners as it bridges the gap between academic techniques and practical implementation, fostering collaboration and advancing research in reliable autonomous systems.

arXiv cs.AI33 d agofound 10 d ago#autonomous_systems#engineering#reliability

Subjective-Graph LLM Agents for Simulating Uncertainty in Classroom Social Perception

The article presents a framework for multi-agent LLMs utilizing subjective graphs to simulate uncertainty in social perception within educational settings. The agents employ individualized graphs to manage peer visibility and communication, updating Gaussian belief states through Bayesian fusion. Evaluated across 12 middle-school classrooms, the framework demonstrated a significant increase in collective ranking error, indicating persistent distortions in perceived academic standing, and outperformed traditional DeGroot configurations in maintaining opinion diversity, highlighting its relevance for practitioners interested in modeling social dynamics in AI systems.

arXiv cs.AI33 d agofound 10 d ago#social-perception#uncertainty

World Models in Pieces: Structural Certification for General Agents

The paper introduces a novel approach called structural certification for general agents, addressing the limitations of standard worst-case analysis in the big-world regime. By formalizing the concept that general agents cannot be universally capable, the authors present algorithms that use deep compositional goals to filter transitions, achieving an error bound of $\mathcal{O}(1/n) + \mathcal{O}(\delta)$ for goal-conditioned performance. This framework allows practitioners to certify the reliability of long-horizon planning in specific transitions, enhancing the deployment of general agents in complex environments.

arXiv cs.AI33 d agofound 10 d ago#agents#world models#certification

2.5-D Decomposition for LLM-Based Spatial Construction

The paper introduces a neuro-symbolic pipeline utilizing 2.5-D decomposition, which enables large language models (LLMs) to plan in a two-dimensional space while a deterministic executor handles vertical placements, significantly reducing systematic coordinate errors in spatial reasoning for autonomous construction. On the Build What I Mean benchmark, the GPT-4o-mini model integrated with this pipeline achieved a mean structural accuracy of 94.6%, outperforming GPT-4o and other competing systems, while demonstrating the ability to run on edge hardware like the Nemotron-3 120B with similar results. This approach is relevant for practitioners as it enhances LLM performance in tasks constrained by physical dimensions, potentially improving reliability in various autonomous construction applications.

arXiv cs.AI33 d agofound 10 d ago#llm#spatial_reasoning#2.5D

India’s MoEngage bets that the future of marketing is millions of AI agents

MoEngage has acquired technology that enables the deployment of individual AI agents for personalized customer interactions. This move signifies a shift towards leveraging AI for targeted marketing strategies, potentially enhancing customer engagement and retention through tailored experiences. Practitioners in AI and marketing will find implications for developing scalable, agent-based systems that can efficiently manage customer relationships.

TechCrunch AI33 d agofound 12 d ago#moengage#ai-agents#marketing

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Ai2 has released the Tmax-27B, a terminal agent LLM built on Qwen3.6, utilizing DPPO for reinforcement learning, achieving approximately 43% on Terminal Bench 2.0 and 69% on TB Lite. The original model is 54 GB at FP16, but various quantized versions (ranging from 2-5 bits-per-weight) have been developed to fit consumer GPUs, with sizes from approximately 8.47 GB to 14.05 GB, making it more accessible for practitioners. This enables developers to leverage advanced terminal capabilities in AI applications without requiring high-end hardware.

Reddit r/LocalLLaMA33 d agofound 21 d ago#tmax-27b#terminal-agent#dppo

Anthropic’s Claude Tag is learning your company, one Slack message at a time

Anthropic has released Claude Tag, an AI assistant integrated into Slack, designed to continuously learn from organizational communications. This feature aims to enhance productivity by capturing contextual and institutional knowledge, which could streamline enterprise workflows. Its implementation may significantly impact how teams leverage AI to optimize collaboration and decision-making processes.

TechCrunch AI33 d agofound 21 d ago#anthropic#claude#slack

The Low-Tech AI of Elden Ring

The article discusses the AI techniques used in the game Elden Ring, highlighting its reliance on low-tech methods rather than advanced machine learning models. It emphasizes the use of finite state machines and behavior trees for NPC decision-making, which allows for complex interactions without the computational overhead typically associated with modern AI approaches. This insight is valuable for practitioners interested in efficient game AI design that prioritizes performance and resource management over cutting-edge techniques.

Hacker News34 d agofound 12 d ago#ai#elden-ring

Infrastructure for the Agentic Web: Gap Analysis and Architecture from the Agentverse Platform

The paper presents a comprehensive analysis of the Agentverse platform by Fetch.ai, which serves as a foundational infrastructure for autonomous AI agents. It catalogs 204 API endpoints and identifies 62 missing capabilities across eight categories, proposing a seven-layer Agent Cloud Stack as a reference architecture for the future. This work is significant for AI practitioners as it outlines essential infrastructure improvements necessary for scaling agent-native applications, aiming to support the development of the agentic web by 2030.

arXiv cs.AI34 d agofound 20 d ago#ai-agents#infrastructure#web

Harnessing Agent Skills: Architectural Patterns and a Reference Architecture for Skill-Mediated LLM Agents

The paper presents a reference architecture for skill-mediated LLM agents, detailing ten architectural patterns that facilitate the transition from static skill artefacts to dynamic skill-in-use. It introduces a framework comprising four responsibility layers: Supply Chain, Mediation, Execution Control, and Evidence & Feedback, evaluated through cross-instantiation across eight systems. This work is significant for practitioners as it offers a structured approach to integrating reusable agent skills, enhancing the robustness and adaptability of LLM applications.

arXiv cs.AI34 d agofound 20 d ago#agent skills#architecture#llm

MemoryVAM: Integrating Memory into Video Action Model for Robot Manipulation

MemoryVAM introduces an episodic memory mechanism for video-world-model policies, enhancing long-horizon manipulation tasks by integrating a Recap-Cue module that compresses per-frame CLIP embeddings into memory tokens. This model employs a lightweight Cue Gate for task completion estimation and can be applied to various backbones, including UNet and Diffusion Transformer, with significant improvements in performance on the LIBERO-Mem benchmark, raising average success rates from 5% to 42.5%. For practitioners, these advancements in memory integration are crucial for developing more robust AI systems capable of handling complex, temporally extended tasks in real-world robot manipulation scenarios.

arXiv cs.AI34 d agofound 15 d ago#robot manipulation#memory#video action

SPARC: A Multi-Agent System for Electrical Circuit Question Answering

SPARC is a newly introduced multi-agent system designed for electrical circuit question answering, leveraging executable physics-based simulations for enhanced reasoning. It employs LLM agents to create, execute, and analyze simulation programs, achieving an accuracy of 83%, which represents up to a 58% absolute improvement over existing baselines. This advancement is significant for practitioners as it facilitates more reliable and accurate responses to complex circuit-related queries, while also enabling systematic error diagnosis.

arXiv cs.AI34 d agofound 20 d ago#qa#circuit diagrams#llm

MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

MacAgentBench is a new benchmark for evaluating computer use agents (CUAs) on macOS, featuring 676 tasks across 25 applications, with a focus on both GUI and CLI interactions. It employs deterministic rule-based evaluation and fine-grained multi-checkpoint scoring, revealing that Claude Opus 4.6 on OpenClaw achieves a 73.7% Pass@1 score, primarily due to its skill library. This benchmark is significant for practitioners as it provides a more comprehensive assessment of agent performance in real-world scenarios, highlighting the importance of framework capabilities in long-horizon, multi-application tasks.

arXiv cs.AI34 d agofound 20 d ago#benchmark#macos#desktop_agents

Role-Based Agentic AI for Intent-Driven Network and Service Orchestration

The paper introduces a role-based multi-agent architecture (MAS) designed for end-to-end intent orchestration in telecommunications, addressing the challenges of integrating Business Support Systems (BSS) and Operations Support Systems (OSS). The architecture features a hierarchical four-layer system that includes leadership, service, and resource agents, which are dynamically instantiated to meet intent requirements. This framework enhances autonomous network management by enabling coordinated planning and service delivery, thereby facilitating scalable and accountable intent-driven orchestration in complex network environments.

arXiv cs.AI34 d agofound 20 d ago#agentic#network#orchestration

CFAgentBench: A Reproducible Environment and Benchmark for Autonomous Construction-Finance Agents

CFAgentBench has been introduced as a reproducible environment and benchmark for autonomous construction-finance agents, featuring 1,014 machine-gradeable task specifications across 8 domains. The benchmark utilizes 35 mock applications and employs functional correctness grading with a money-movement guard to ensure tasks requiring human approval are not executed automatically. Initial evaluations show that the strongest agent achieves a pass rate of 0.67 for single attempts, but drops to 0.38 under repeated conditions, highlighting the challenges in deploying reliable construction-finance AI systems.

arXiv cs.AI34 d agofound 20 d ago#autonomous agents#benchmark#construction

ARCO: Adaptive Rubric with Co-Evolution for Multi-Step LLM-Based Agents

The article introduces ARCO (Adaptive Rubric CO-evolution), a novel framework for improving credit assignment in multi-step LLM-based agents through adaptive rubric-based rewards. ARCO features a shared backbone model with a generation head for producing per-step criteria and a scoring head for predicting step-level rewards, enabling dynamic co-evolution of rubric content and scoring functions without requiring step-level labels. The framework demonstrates significant performance improvements on benchmarks like HotpotQA and MuSiQue, providing practitioners with a more interpretable and effective approach to reinforcement learning in LLMs.

arXiv cs.AI34 d agofound 20 d ago#multi-step agents#llm#adaptive rubric

How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs

This study evaluates the effectiveness of hierarchical organization of action logs in Programming by Demonstration (PbD) for LLM agents. By grouping actions into labeled subgoals, the researchers found that this structure improved task completion rates in ambiguous natural-language tasks from 76.7% to 90.7%, while flat action logs showed minimal improvement. The findings suggest that for practitioners, implementing a hierarchical approach in PbD pipelines can significantly enhance the quality of procedural knowledge conveyed to LLMs, particularly in scenarios where task descriptions are vague.

arXiv cs.AI34 d agofound 20 d ago#programming by demonstration#llm#agents

AlphaMemo: Structured Search-Process Memory for Self-Evolving Alpha Mining Agents

AlphaMemo is introduced as a self-evolving alpha mining agent that utilizes Structured Search-Process Memory to enhance the efficiency and effectiveness of financial model development. It innovatively records reusable evidence of successful and unsuccessful edit motifs derived from Abstract Syntax Tree (AST) differences, employing techniques like confidence-gated residual memory and asymmetric veto control to mitigate overfitting and redundancy. Experimental results on the CSI 500 and S&P 500 datasets demonstrate improved out-of-sample performance and discovery efficiency, making it a valuable tool for practitioners in financial AI modeling.

arXiv cs.AI34 d agofound 20 d ago#alpha mining#llm#agents

ChainWorld: Composing Long-Horizon Desktop Workloads from Atomic OSWorld Tasks

ChainWorld introduces a framework for composing long-horizon desktop workloads from atomic OSWorld tasks, addressing the gap in evaluating computer use agents beyond single atomic tasks. It features 347 task chains of varying lengths and compares single turn versus multi turn evaluation methods, revealing a maximum chain completion rate of 31%. This work is significant for practitioners as it highlights the complexities of task management and state preservation in AI agents, informing the design of more robust systems for long-duration user interactions.

arXiv cs.AI34 d agofound 20 d ago#llm#agents#task-composition

MetaPS: Adaptive Programmatic Strategy Selection for Market Agents

MetaPS is a new framework for adaptive programmatic strategy selection in financial markets, utilizing a library of executable strategies that respond to changing market conditions. It employs a simulation-guided approach to identify optimal strategy-state pairs, which are then used for supervised fine-tuning, enhancing performance across models ranging from 0.8B to 9B parameters. The framework shows significant improvements over fixed-strategy baselines and direct decision-making agents, indicating that market simulations can effectively provide targeted supervision for developing interpretable and adaptable trading strategies.

arXiv cs.AI34 d agofound 20 d ago#market agents#strategy selection#LLM

An LLM-Explainable DRL Framework for Passenger-Directed Autonomous Driving

This article presents a novel framework that integrates deep reinforcement learning (DRL) with large language model (LLM) explainability for autonomous driving systems. The DRL agents, trained using a Dueling Double Deep Q-Network, effectively adapt to driving requests such as "fast," "comfort," and "stop," while LLM modules provide real-time explanations of the agents' behaviors to passengers. This approach enhances public trust in autonomous vehicles by improving transparency and safety, making it significant for practitioners focused on developing explainable AI in transportation.

arXiv cs.AI34 d agofound 20 d ago#autonomous driving#explainability#llm

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows

VADAOrchestra is a neurosymbolic framework designed to enhance decision-making workflows by integrating Large Language Models (LLMs) with logical reasoning. It utilizes a hybrid architecture where an LLM-based orchestrator incrementally adapts workflows based on user queries and data sources, encoding these workflows as Datalog+/- logic programs. This approach facilitates verifiable reasoning and scalability, allowing for complex reasoning over large datasets while maintaining auditability and reproducibility, as evidenced by evaluations on real-world financial use cases.

arXiv cs.AI34 d agofound 20 d ago#neurosymbolic#reasoning workflows#LLM

The Ratchet Effect in Silico: How Interaction Drives Cumulative Intelligence in Large Language Models

The article introduces POLIS (Population Orchestrated Learning and Inference Society), a framework designed to enhance cumulative intelligence in large language models through interaction among heterogeneous agents. It reports that populations of models with 1-4 billion parameters achieved significant improvements of 8.8-18.9 points on mathematical reasoning benchmarks compared to base models, effectively narrowing the performance gap with larger 70 billion parameter models. This research highlights the importance of structured social interaction and peer verification as mechanisms for knowledge retention and growth, suggesting a new avenue for scaling LLM performance beyond mere parameter increases.

arXiv cs.AI34 d agofound 14 d ago#cumulative-intelligence#large-language-models

When Agents Commit Too Soon: Diagnosing Premature Commitment in LLM Agents

This paper introduces a diagnostic for identifying premature commitment in long-horizon LLM agents, where agents settle on a single interpretation of evidence too early in the reasoning process. The authors define representational commitment through cross-run hidden-state convergence and demonstrate its predictive power for behavioral consistency across models like Llama-3.1-70B, Qwen-2.5-72B, and Phi-3-14B, with high AUROC scores for detecting inconsistent trajectories. This research highlights a critical failure mode that impacts reasoning reliability, emphasizing the need for runtime monitoring and intervention strategies to mitigate variance without sacrificing accuracy.

arXiv cs.AI34 d agofound 20 d ago#premature commitment#llm#diagnosis

PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement

PAPERCLAW is a multi-agent system designed to automate the research process from literature curation to paper writing. It operates through an iterative propose-test-reflect loop, utilizing a full-lifecycle memory to maintain context, allowing for pausing and resuming projects. The system integrates a human-in-the-loop mechanism for refinement and has been evaluated against an LLM judge, demonstrating its capability to produce high-quality research papers autonomously and with human input, which is significant for practitioners aiming to enhance research efficiency and output quality in AI.

arXiv cs.AI34 d agofound 20 d ago#multi_agent_system#autonomous_research#paper_generation

Optimization-as-a-Service via Multi-Agent Large Language Model for Radio Access Networks

The article presents a novel approach to physical resource block (PRB) allocation in Radio Access Networks (RANs) through an Optimization-as-a-Service (OaaS) framework utilizing a multi-agent large language model (LLM-MA). This system features a closed-loop architecture with agents that dynamically formulate optimization problems and objectives, incorporating a one-shot reflection distillation mechanism to minimize computational latency. Experimental results indicate that the proposed framework achieves near-optimal resource allocation with significantly reduced inference latency, addressing the challenges posed by the dynamic conditions of sixth-generation (6G) environments.

arXiv cs.AI34 d agofound 20 d ago#optimization#radio#networks

Democratizing and accelerating AI-driven pathology research through agentic intelligence

PathLab is a newly introduced autonomous agentic framework designed to simplify computational pathology by translating natural-language research objectives into executable workflows. It organizes workflow generation using reusable modules for tasks such as data preprocessing and model evaluation, achieving performance comparable to expert implementations across 12 public datasets in tasks like image classification and segmentation. This framework significantly reduces the time needed to create analytical pipelines, enabling domain experts without programming skills to independently conduct computational pathology studies, thereby democratizing access to advanced AI methodologies in this field.

arXiv cs.AI34 d agofound 20 d ago#computational pathology#workflow#autonomous

From Question Answering to Task Completion: A Survey on Agent System and Harness Design

This survey presents an analysis of LLM-based agents, emphasizing their evolution from passive question answering to active task completion through the integration of execution harnesses. It decomposes agent systems into six runtime responsibilities—observation, context, control, action, state, and verification—and explores how these interact with foundational models, highlighting the importance of harness configurations on task efficiency and reliability. The findings underscore the need for a holistic approach to agent design that considers model-harness co-evolution, open challenges in evaluation, and safety, which are critical for practitioners developing advanced AI systems.

arXiv cs.AI34 d agofound 20 d ago#llm#task completion#agent systems

Task-Differentiated Atomic Skill Expansion and Routing for Continual Learning Across Highly Heterogeneous Tasks

The article presents Task-Differentiated Atomic Skill Expansion and Routing (TASER), a continual learning framework designed to tackle challenges in heterogeneous task environments by dynamically expanding atomic skills based on task divergence and model uncertainty. TASER employs orthogonality-enhanced skill detection and a skill dynamic routing mechanism to ensure skills are semantically distinct and task-relevant. The introduction of the HeteroCLBench benchmark, which includes 19 diverse tasks across 9 cognitive dimensions, demonstrates TASER's superior performance in enhancing model plasticity and mitigating catastrophic forgetting compared to existing methods.

arXiv cs.AI34 d agofound 16 d ago#continual_learning#task_differentiation

GRAG: Generic Response-Augmented Generation Framework for Personalized Conversational Systems

The Generic Response-Augmented Generation (GRAG) framework has been introduced to enhance personalized conversational agents by decoupling content grounding from personalization, addressing computational challenges in resource-constrained settings. GRAG utilizes offline, generic responses from large language models (LLMs) to guide the fine-tuning of smaller, task-specific models, resulting in significant performance improvements on benchmark datasets, achieving up to 47% higher ROUGE-2 and 36% higher BLEU scores compared to existing methods. This framework provides a scalable approach for developing grounding-aware conversational systems that maintain contextual relevance while ensuring personalized interactions.

arXiv cs.CL34 d agofound 13 d ago#conversational#personalization#framework

Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents

The paper presents a study on the impact of unreliable feedback on tool-using LLM agents, specifically evaluating how misleading information can lead to a value inversion where agents perform worse than without feedback. Using the Qwen2.5-7B model on the HotpotQA benchmark, the research shows significant performance discrepancies: 44.8 F1 with reliable retrieval drops to 4.7 F1 under misleading conditions. These findings emphasize the need for robust fallback mechanisms and careful evaluation of tool-augmented agents, as relying solely on external feedback may misrepresent their effectiveness.

arXiv cs.AI34 d agofound 20 d ago#llm#agents#feedback

Building Agent Harnesses for Scientific Curation from Multimodal Sources

The article introduces Beaver, a new agent harness designed for structured scientific curation from multimodal sources, which effectively extracts information from diverse formats such as text, tables, and figures while maintaining provenance. Beaver integrates a frontier agent with multimodal evidence tooling and task scaffolding, achieving a Gold-Referenced Attribute Score (GRAS) of 81.0, surpassing previous frontier agents by over 23 points. This development highlights the importance of harness design in enhancing agent performance for scientific workflows, particularly in tasks requiring cross-modal reasoning.

arXiv cs.AI34 d agofound 20 d ago#scientific curation#multimodal#agents

ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents

The article introduces Environment-Native Verified Search (ENVS), a novel approach for training long-horizon GUI agents that leverages environment feedback to enhance supervision during policy optimization. ENVS utilizes a search-and-filter pipeline in live OSWorld VMs to verify successful GUI actions, achieving a pass rate of 30.3 on the 300-task OSWorld benchmark and 29.0 on the OSWorld-Noisy benchmark, while significantly reducing computational costs from 184-192 GPU hours to 138-153. This method not only improves task performance but also enhances visual-reasoning capabilities in noisy environments, making it a valuable advancement for practitioners developing robust AI agents for real-world applications.

arXiv cs.AI34 d agofound 20 d ago#gui#agents#training

CoorDex: Coordinating Body and Hand Priors for Continuous Dexterous Humanoid Loco-Manipulation

CoorDex introduces a novel learning pipeline for continuous dexterous humanoid loco-manipulation, enabling a Unitree G1 humanoid equipped with a 20-DoF WUJI hand to perform complex tasks while in motion. The system utilizes coordinated latent residual control, leveraging proprioception-conditioned latent priors distilled from high-dimensional demonstrations to enhance finger-level contact reliability and maintain natural body motion. This approach outperforms traditional methods in joint-space control and monolithic latent prediction, making it a significant advancement for practitioners focusing on high-DoF manipulation tasks in robotics.

arXiv cs.AI34 d agofound 15 d ago#humanoid#loco-manipulation#dexterous control

Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

The paper introduces a framework for understanding context management in long-horizon LLM agents, particularly focusing on how plans are managed and their impact on performance. It presents the concept of replay pairing to diagnose the decay of plan signals in the hidden state of Llama-3.1-70B, revealing that plans do not persist as state and are heavily dependent on remaining in context. The findings indicate that naive eviction of plans can significantly reduce task success rates, underscoring the importance of effective context management strategies for practitioners developing LLM applications.

arXiv cs.AI34 d agofound 20 d ago#context management#llm#plans

You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector

The article presents a novel approach to enhancing pretrained generative robot policies by utilizing a constant initial noise vector, termed a "golden ticket," instead of sampling from a Gaussian distribution. This method, applicable to various diffusion and flow matching policies, demonstrated significant performance improvements across 46 out of 51 tasks, with success rates increasing by up to 55% in simulated environments and 28% in real-world scenarios. The approach requires no additional training or infrastructure, making it easily deployable, and the authors have released a codebase containing pretrained policies and golden tickets for further experimentation.

arXiv cs.AI34 d agofound 14 d ago#robot policies#policy improvement#generative models

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

AGENTSERVESIM is introduced as a hardware-aware simulator specifically designed for multi-turn LLM agent serving, addressing the complexities of stateful program execution that traditional simulators overlook. It features components such as a Program Orchestrator, Tool Simulator, Session-Aware Router, and KV Residency Model, allowing for comprehensive evaluation of serving policies with a reported accuracy of within 6% of real-system behavior on commodity CPUs. This tool is significant for practitioners as it facilitates controlled exploration of agent-serving strategies, optimizing performance without the need for extensive and costly hardware deployments.

arXiv cs.AI34 d agofound 13 d ago#multi-turn#LLM#simulation

EHR-Complex: Benchmarking Medical Agents for Complex Clinical Reasoning

EHR-Complex is a newly introduced benchmark aimed at evaluating interactive clinical database reasoning, built on the MIMIC-IV dataset, which includes 365K patients and over 500 million records. It consists of approximately 52,000 tasks across six clinical intents, requiring agents to execute SQL queries or Python code in a sandboxed environment, reflecting real-world complexities in EHR analysis. The benchmark reveals significant challenges in SQL task execution, with top models achieving only 62.3% accuracy and exposing failure modes such as SQL logic errors and semantic misunderstandings, underscoring the need for improved reasoning capabilities in clinical AI applications.

arXiv cs.AI34 d agofound 20 d ago#benchmark#clinical

Latent Goal Prediction from Language for Model-Based Planning

The article introduces Latent Goal Prediction from Language (LAGO), a novel framework designed for model-based planning that predicts intermediate goal states from language instructions and action-conditioned rollouts within a shared latent space. LAGO addresses the limitations of traditional methods by dynamically decomposing instructions into tractable latent subgoals, allowing for coherent long-horizon planning without the degradation seen in prior approaches. This advancement is significant for practitioners as it combines the precision of visual goals with the flexibility of language, enhancing the effectiveness of planning in complex environments.

arXiv cs.AI34 d agofound 20 d ago#planning#language#goals

Can Reasoning Models Detect Changes to their Chains of Thought?

This study investigates the ability of recent reasoning models to detect modifications in their chains of thought (CoT), assessing their performance under various conditions, including self and external CoT alterations. The findings reveal that models demonstrate only modest detection accuracy and struggle to identify the nature of the changes, indicating that interventions may not significantly alter model behavior. This research highlights the challenges in ensuring model robustness against tampering, which is crucial for practitioners aiming to improve safety and reliability in AI applications.

arXiv cs.AI34 d agofound 20 d ago#reasoning#intervention#chains of thought

Managing Procedural Memory in LLM Agents: Control, Adaptation, and Evaluation

The paper introduces AFTER, a benchmark comprising 382 enterprise tasks across six professional roles and 22 procedural skills, aimed at evaluating the transferability of skills in LLM agents. Results indicate that procedural memory enhances performance in industrial workflows, with a single refinement round yielding a 3.7-6.7 point improvement and achieving 73.1% cross-model test accuracy through multi-model execution traces. This research offers critical insights for practitioners on effectively implementing and assessing procedural memory systems in AI agent applications.

arXiv cs.AI34 d agofound 20 d ago#llm#memory#benchmark

SVGym (SciVerseGym): An Environment for Reinforcement Learning and Bayesian Optimization in Crystal Discovery

SciVerseGym is a new, Gymnasium-compatible environment designed for reinforcement learning and Bayesian optimization in crystal discovery, framing the problem as a Markov decision process. It allows agents to perform various chemically meaningful actions, such as elemental substitution and atomic displacement, and evaluates candidates using machine-learned interatomic potentials or ASE-compatible calculators, facilitating an open and extensible framework for researchers in materials science. This environment supports customizable chemical spaces and rewards, making it a valuable tool for practitioners aiming to streamline and enhance closed-loop crystal search methodologies.

arXiv cs.AI34 d agofound 20 d ago#reinforcement learning#crystal discovery#environment

Active Inference as the Test-Time Scaling Law for Physical AI Agents

The paper introduces a novel test-time scaling law for physical AI agents based on active inference, allowing for effective reasoning and generalization in unforeseen scenarios. This law dynamically updates the agent's policy via a soft Bayesian inference process that minimizes prediction errors, enabling learning beyond the training distribution. Simulation results indicate that this approach significantly outperforms traditional methods like Q-learning and Bayesian reinforcement learning in autonomous driving tasks, enhancing inference efficiency by over 36%.

arXiv cs.AI34 d agofound 20 d ago#active inference#ai agents#scaling laws

Dementia-Agents: A Multi-Modal Multi-Agent System for Dementia Staging and Phenotyping

The paper introduces Dementia-Agents, a multi-modal multi-agent system designed for staging and phenotyping dementia by integrating diverse clinical assessments in real-world settings. The framework employs a three-step workflow involving a data agent for processing clinical records, multiple expert agents for generating predictions, and a coordinator agent for aggregating results, demonstrating improved diagnostic performance over existing multi-modal large language models and prior systems. This approach is significant for practitioners as it enhances the interpretability and accuracy of dementia diagnosis, addressing the complexities of syndrome-level assessments beyond traditional Alzheimer's-focused models.

arXiv cs.CL34 d agofound 13 d ago#dementia#multi-modal#system

MindTailor: Personalized Emotional Support via Post History-Grounded Case Formulation and Collaborative Refinement

MindTailor is a framework designed to provide personalized emotional support by utilizing a seeker's post history for case formulation and refining responses through collaborative critique among counselor agents. It introduces the ReddiSupp dataset, consisting of 798 Reddit posts and corresponding post histories, which facilitates the evaluation of this history-aware approach. The framework demonstrates superior performance in empathy, personalization, and understanding compared to baseline models, making it a significant advancement for practitioners aiming to enhance mental health support systems using LLMs.

arXiv cs.CL34 d agofound 13 d ago#emotional support#llm#collaboration

LLM-assisted gNB Parameter Configuration for Radio Access Network

This paper presents a framework leveraging a large language model (LLM) for automatic parameter configuration of gNB in radio access networks (RANs). By fine-tuning the LLM using synthetic training data derived from gNB error logs, the system achieves a significant accuracy improvement in correcting misconfigurations, from 13.8% to 85.4%, and up to 92.7% with retrieval-augmented generation (RAG) on an OpenAirInterface testbed. This advancement is crucial for practitioners as it facilitates scalable, autonomous operations in RANs, reducing reliance on manual configurations and enhancing system reliability.

arXiv cs.AI34 d agofound 20 d ago#llm#network-configuration#automation

Process-Reward Tactic Evolution for Long-Horizon Bioinformatics Workflows

The paper introduces Process-Reward Tactic Evolution, a training framework designed for LLM agents to effectively manage long-horizon bioinformatics workflows using Galaxy. This framework utilizes a curriculum-based approach to train agents on workflow execution, incorporating a tactic library derived from verified workflow rollouts, which enhances the agent's ability to construct workflows, monitor execution, and ensure biological correctness. The evaluation demonstrates that this process-supervised tactic accumulation significantly improves workflow completion rates and execution efficiency compared to traditional no-memory and reflection-style approaches, highlighting its relevance for practitioners developing complex bioinformatics applications.

arXiv cs.AI34 d agofound 20 d ago#bioinformatics#workflow#llm

AutoRAS: Learning Robust Agentic Systems with Primitive Representations

AutoRAS is a newly proposed framework for the automated design of robust agentic systems, focusing on optimizing sequences of symbolic primitives that encode both structural connectivity and behavioral actions. The framework leverages execution-derived safety signals and flow-based objectives, demonstrating superior performance in both standard and adversarial settings with minimal degradation under attacks. This approach is significant for practitioners as it enhances the robustness of large language models in multi-agent environments, addressing vulnerabilities that can arise from external adversaries and internal failures.

arXiv cs.AI34 d agofound 20 d ago#llm#agents#robustness

BioInsight: Multi-Agent Orchestration for Interactive Biomedical Knowledge Discovery

BioInsight is a newly introduced multi-agent system designed for interactive biomedical knowledge discovery, moving beyond static report generation to create interactive evidence-centered interfaces. It organizes disease-specific evidence through various artifacts such as ranked pathways and citation-grounded reports, and evaluates its performance on standardized biomedical QA and protein-function reasoning tasks, achieving superior results. This development is significant for practitioners as it emphasizes the need for dynamic, interactive tools that enhance research decision-making by allowing users to explore and refine hypotheses based on comprehensive evidence.

arXiv cs.AI34 d agofound 20 d ago#multi-agent#biomedical#knowledge discovery

DART: Draft-Agreement Routing for Training-Free Adaptive Thinking Budgets in Hybrid Reasoning Models

DART is a novel training-free routing framework designed for hybrid reasoning models, enabling adaptive thinking budgets based on query difficulty without requiring labeled training data. It samples two no-think drafts to determine whether to answer directly or allocate additional reasoning tokens, resulting in improved accuracy—up to +9.0 points on math reasoning and +22.5 points on code reasoning—while significantly reducing token usage by 15-69%. This approach is applicable across a range of model sizes (0.6B to 32B) and settings, making it a valuable tool for practitioners aiming to optimize resource allocation in AI reasoning tasks.

arXiv cs.AI34 d agofound 20 d ago#routing#reasoning

RIZZ: Routing Interactions to Near Zero-Interference Zones for Continual Adaptation of Black-Box Agents

RIZZ (Routing Interactions to Near Zero-Interference Zones) is a new continual adaptation framework designed for black-box agents, enabling them to adapt to diverse user inputs and tasks without access to model weights. It utilizes a verifier-gated memory system and context-aware routing to dynamically manage input streams and compile prompts from various memory branches, thereby controlling interference from nonstationary feedback. This approach shows improved performance against existing state-of-the-art methods on competitive benchmarks, making it significant for practitioners developing adaptive AI systems that require robust online learning capabilities.

arXiv cs.AI34 d agofound 20 d ago#language models#adaptation#memory

From Knowing to Acting: Benchmarking Self-Awareness Capability of LLM Agents

The paper introduces KAPRO, a new benchmarking framework designed to evaluate the self-awareness capability of LLM agents by distinguishing between their metacognitive judgment and execution actions. It also presents the KAware dataset, which categorizes tasks into external, internal, and hybrid types to assess agents' cognitive alignment. The findings indicate that self-awareness is crucial for task success, with open-source models showing tendencies for tool overuse, which highlights the need for improved cognitive gating in LLM architectures.

arXiv cs.AI34 d agofound 20 d ago#self-awareness#llm#benchmarking

Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

The paper introduces Lexical Consensus, a framework for grounded word learning in AI agents, utilizing frozen DINOv2 visual embeddings and Carroll-style nonce words. Key findings reveal a robust perceptual-coherence gradient in lexical acquisition, where native categories are learned most easily, and perceptual distance is a significant predictor of acquisition accuracy, while semantic distance has negligible impact. This research highlights the importance of perceptual geometry in grounding lexical meanings, which is crucial for practitioners developing AI systems that require effective word learning and concept generalization.

arXiv cs.AI34 d agofound 15 d ago#lexical#word#learning

RaMem: Contextual Reinstatement for Long-term Agentic Memory

The paper introduces RaMem (Contextual Reinstatement for Agentic Memory), a framework designed to enhance long-term memory in LLM agents by addressing the issue of context collapse, where retrieved memories lack the contextual information necessary for valid evidence. RaMem employs a four-stage process: evidence anchoring, recall condition induction, validity-aware retrieval, and context-preserved synthesis, leading to significant improvements in performance on long-term memory benchmarks, with average F1 score increases exceeding 10% across various model backbones. This advancement is crucial for practitioners as it enables more accurate and contextually relevant memory utilization in AI applications, enhancing the reliability of LLMs in complex, evolving tasks.

arXiv cs.AI34 d agofound 20 d ago#memory#llm#context

When Web Agents Finish but Still Fail: Reproducible Triggers and Trace Diagnostics for Parallel Web Exploration

The paper introduces Parallel WebBench, a benchmark for evaluating long-horizon web agents, comprising 1,679 verified records to analyze failure modes in web exploration. The authors train WebExplorer-style agents using GRPO, achieving a significant improvement in completion rates from 50.7% to 96.0% and an F1 score increase from 0.2489 to 0.4529 at 16k context and 16 interaction rounds. Despite these advancements, the study highlights persistent issues with context-bound search loops, premature termination, and synthesis collapse, indicating a need for enhanced evidence-grounded coverage and diagnostics to address completion-correctness gaps in AI systems.

arXiv cs.AI34 d agofound 20 d ago#web agents#failure analysis#benchmark

IRumAI: Reinforcement Learning for Indian Rummy

IRumAI is a novel reinforcement learning agent designed for Indian Rummy, utilizing Proximal Policy Optimization (PPO) and a dual-branch convolutional architecture. It achieves a 53.9% win rate against the strongest search-based opponent, demonstrating significant performance with a rapid inference time of 0.33 ms per action, which is over 7,000 times faster than existing heuristic methods. This advancement is crucial for practitioners as it offers a new approach to handling complex hidden-information games efficiently without relying on explicit search techniques.

arXiv cs.AI34 d agofound 20 d ago#reinforcement learning#rummy#PPO