last updated 4 h ago

Daily digest № 29 — Thursday, July 16, 2026

The day in AI, distilled.

sort: relevance · 200 articles

found 12 d agoAnthropic NewsProducts

Introducing Claude Tag

Claude Tag is a new collaborative feature designed for teams using the Claude language model. It allows for enhanced organization and management of conversations and tasks, potentially improving workflow efficiency. This development is significant for practitioners as it introduces a structured approach to leveraging LLM capabilities in team environments.

claude-tagteam-collaboration

32 d agofound 12 d agoMarkTechPostProducts

Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency

Gradium has launched two real-time speech translation models, stt-translate and s2s-translate, which provide translation across 20 language pairs including English, French, German, Spanish, and Portuguese. These models streamline the traditional three-model setup into a two-model architecture, combining transcription and translation in a single pass with a text-to-speech (TTS) stage, achieving superior accuracy and lower latency compared to gpt-realtime-translate and gemini-3.5-live-translate. This advancement is significant for practitioners as it enhances real-time translation capabilities while offering features like output voice selection and cloning, potentially improving user experience in multilingual applications.

speech_translationmodelsgradium

32 d agofound 12 d agoThe DecoderProducts

OpenAI says ChatGPT Instant now better understands what users actually want

OpenAI has released an update for its GPT-5.5 Instant model, enhancing its ability to understand user intent and manage context over multiple interactions. The improvements focus on better recognition of complex, multi-condition prompts, which is crucial for developers aiming to create more responsive and contextually aware AI applications. This update is significant for practitioners as it allows for more nuanced interactions and potentially increases user satisfaction in conversational AI systems.

chatgptupdatesopenai

32 d agofound 12 d agoReddit r/LocalLLaMAMultimodal

SDXL running locally in the browser on WebGPU, open-source

An open-source browser extension has been released that enables local image generation using the SDXL model via WebGPU, eliminating the need for complex installations. It supports two model versions: SDXL-Lighting fp16 (approximately 7 GB storage, requiring around 8 GB VRAM) and a 4-bit variant for lower-spec hardware (about 3.6 GB storage, needing 4-5 GB VRAM). This development allows practitioners to run image generation models directly in the browser, albeit with noted performance limitations due to synchronous WebGPU shader compilation, providing a new approach to leveraging AI models without extensive setup.

sdxlwebgpuimage generation

32 d agofound 12 d agoMarkTechPostAgents

How to Design an OpenHarness Style Agent Runtime with Tools, Memory, Permissions, Skills, and Multi-Agent Coordination

The article presents a tutorial on constructing an OpenHarness-style agent runtime, detailing the implementation of core components such as tool use, typed tool schemas, permissions, lifecycle hooks, memory management, skills, context compaction, retry logic, cost tracking, and multi-agent coordination. It emphasizes an open control flow, allowing practitioners to experiment with the system without reliance on external APIs or infrastructure. This approach provides valuable insights for AI engineers looking to build flexible and scalable agent systems.

agent_runtimetoolsmulti-agent

32 d agofound 12 d agoReddit r/LocalLLaMACoding

I reverse engineered Windows Copilot into a free OpenAI compatible API (GPT-4, no API key, no billing)

A developer has reverse-engineered the Windows Copilot to create an unofficial API that allows users to access GPT-4 without an API key or billing, utilizing their own Microsoft account. The setup exposes a local server at `http://localhost:8000/v1`, enabling compatibility with the OpenAI SDK for streaming and multi-turn conversations, making it a cost-effective solution for lightweight AI workloads and automation. This project offers practitioners a way to leverage GPT-4 capabilities for personal and educational use without incurring costs associated with standard API access.

openaiapigpt-4

32 d agofound 12 d agoTechCrunch AIProducts

Facebook rolls out an AI companion app for creators

Facebook has announced the rollout of an AI companion app for creators, currently in testing with a select group. This app integrates Facebook's recently launched AI creator assistant, aimed at enhancing content creation capabilities. This development is significant for practitioners as it suggests a shift towards more AI-driven tools for content generation within social media platforms.

facebookaicreator-assistant

32 d agofound 12 d agoReddit r/LocalLLaMACoding

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

A user has reported discrepancies in output quality between vLLM and llama.cpp while testing the same model under similar settings and quantizations. Although vLLM demonstrates superior performance and concurrency, it exhibits issues such as formatting errors, context retention failures, and lower quality code outputs. This raises questions about the impact of quantization, configuration, and template parsing on inference quality, which is critical for practitioners optimizing model deployment and performance.

vllmllama.cppcomparison

32 d agofound 12 d agoThe DecoderModels

Snowflake CEO finds GLM-5.2 competitive with Opus 4.7 at a fraction of the cost

Zhipu AI's GLM-5.2 has demonstrated competitive performance with Anthropic's Claude Opus 4.7 in a benchmark involving 103 coding tasks, achieving similar results at one-fifth the cost per output token. However, GLM-5.2 consumes nearly twice as many tokens per task, highlighting a trade-off between cost efficiency and token usage. This pricing disparity could impact the market dynamics and valuations of Western AI companies.

glmsbenchmarkcost

32 d agofound 12 d agoReddit r/LocalLLaMAOpen Source

Sipp - an open-source library for in-browser inference built on llama.cpp

Sipp is an open-source library designed for in-browser inference, built on the llama.cpp framework. This library allows developers to leverage LLaMA models directly in the browser, facilitating real-time AI applications without server dependency. Its release is significant for practitioners looking to implement efficient, client-side AI solutions while maintaining model accessibility and performance.

llamainferenceopen_source

32 d agofound 12 d agoThe Verge — AIProducts

Figma now has AI motion graphics and shader tools

Figma has introduced AI-powered motion graphics and shader tools, enhancing its design platform to support full-stack development. The updates aim to automate repetitive tasks and streamline workflows by integrating AI agents with design tools. This is significant for practitioners as it allows for more efficient design processes and the potential for advanced visual effects without extensive manual coding.

figmaaimotion-graphics

32 d agofound 12 d agoTechCrunch AIProducts

Figma adds code layers, support for animations, more AI features in new update

Figma's latest update introduces a new code layer that allows for the integration of custom scripts, alongside enhanced support for motion and shader effects. Additionally, it enables the creation of custom plugins utilizing AI, which can streamline various design tasks. This update is significant for practitioners as it expands Figma's capabilities for developing interactive and dynamic designs, integrating coding directly into the design workflow.

figmaaifeatures

32 d agofound 12 d agoTechCrunch AIProducts

OpenAI unveils its first custom chip, built by Broadcom

OpenAI and Broadcom have announced the Jalapeño chip, a custom AI inference chip specifically designed for large language model (LLM) optimization. This chip aims to enhance performance, efficiency, and scalability in AI systems, potentially benefiting practitioners working with LLMs by providing a dedicated hardware solution for inference tasks.

openaijalapenocustom-chip

32 d agofound 12 d agoThe Verge — AIProducts

OpenAI reveals its first AI processor: Jalapeño

OpenAI has announced the Jalapeño, its first AI processor developed in collaboration with Broadcom, designed specifically for AI inference. This ASIC targets the needs of current and future large language models, enhancing performance for AI applications. The introduction of dedicated hardware like Jalapeño is significant for practitioners, as it may optimize model deployment and inference efficiency in production environments.

openaijalapenoai-processor

32 d agofound 12 d agoReddit r/LocalLLaMAInference

OpenAI and Broadcom unveil LLM-optimized inference chip

openaibroadcomllmchip

32 d agofound 12 d agoReddit r/LocalLLaMAResearch

The Swiss Federal Supreme Court is evaluating Heretic

The Swiss Federal Supreme Court is assessing the Heretic model for its own applications, particularly to address issues with LLMs denying legitimate requests. A paper titled “Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts” explores solutions to this problem, highlighting Heretic's effectiveness in its analysis. This evaluation is significant for practitioners as it indicates potential advancements in handling LLM alignment issues in legal contexts.

llmalignmentcourt

32 d agofound 12 d agoThe DecoderProducts

OpenAI and Broadcom unveil "Jalapeño," a custom chip built for LLM inference

llminferenceopenai

32 d agofound 12 d agoReddit r/LocalLLaMATraining

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

The author successfully modified the GLM-5.2 model to achieve significant performance improvements, increasing throughput from approximately 2.5 tokens/second to over 55 tokens/second on a custom GH200 system equipped with dual Hopper H100 GPUs and dual Grace CPUs. The optimization involved merging the MTP head from zai's GLM-5.2-FP8 repository with CyanKiwi's AWQ quantized version, requiring specific adjustments to the vLLM framework. This achievement highlights the potential for practitioners to enhance model performance through architectural tweaks and custom configurations, particularly in high-performance computing environments.

llmoptimization

33 d agofound 12 d agoThe DecoderProducts

OpenAI's deployment chief on Codex growth, falling AI prices, and the ROI question

OpenAI's deployment chief, Arnaud Fournier, discussed the rapid growth of Codex and the strategic integration of AI into large corporations through dedicated engineering teams. He noted a significant decrease in AI costs, attributing this to improved efficiencies and a robust feedback loop from customers that informs model development. This insight is crucial for practitioners as it highlights the evolving landscape of AI deployment and the potential for enhanced return on investment through optimized AI solutions.

codexopenaideployment

33 d agofound 12 d agoReddit r/LocalLLaMAAgents

Nex-N2-Mini-Ultra-Uncensored-Heretic Is Out Now, an Agentic Model With Agentic Thinking Now Uncensored With 5/100 Refusals and 0.0020 KLD, Available in Safetensors and GGUF Formats!

The Nex-N2-Mini-Ultra-Uncensored-Heretic model has been released, featuring 35 billion parameters and achieving a refusal rate of 5/100 with a Kullback-Leibler Divergence (KLD) of 0.0020. It is available in both Safetensors and GGUF formats, and utilizes Heretic version 1.2.0, which has shown better performance in terms of KLD compared to the newer version. This model's reduced censorship and enhanced performance metrics make it significant for practitioners looking to implement more flexible and capable LLMs in their applications.

nex-n2agentmodel

33 d agofound 12 d agoReddit r/LocalLLaMAAgents

Qwen-AgentWorld-35B-A3B for Coding?

The Qwen-AgentWorld-35B-A3B model has been benchmarked, achieving an overall score of 56.39 and notable performance in specific categories such as Search (36.69) and SWE (65.63). This model, part of the Qwen series, provides important insights for practitioners focused on coding tasks, as it demonstrates competitive capabilities in software engineering contexts compared to other models like Qwen3.5-397B-A17B. Understanding these benchmarks can guide developers in selecting appropriate models for their AI applications.

qwencodingmodel

33 d agofound 12 d agoReddit r/LocalLLaMACoding

llama.cpp's web UI now supports executing model generated JavaScript in the browser, through Web Workers (opt in)

The recent update to llama.cpp's web UI includes a new `run_javascript` tool that enables the execution of model-generated JavaScript within the browser using Web Workers. This feature operates in a sandboxed iframe, providing security guarantees, though it currently restricts network requests and lacks clear documentation on sandbox limitations. This enhancement allows practitioners to leverage language models for lightweight code execution directly in the UI, potentially reducing the need for external tools.

llama.cppjavascript

33 d agofound 12 d agoMIT Technology Review — AIResearch

The emergence of the web data infrastructure layer for AI

The article discusses the development of a web data infrastructure layer aimed at improving data accessibility and structure for AI applications. This infrastructure is crucial for enterprises seeking to utilize AI at scale, as it addresses the challenges posed by unstructured and blocked data that can hinder model performance. By enhancing data availability and organization, it enables practitioners to build more effective AI systems and leverage emerging use cases.

data-infrastructureaiweb

33 d agofound 12 d agoReddit r/LocalLLaMAModels

How Baidu's newly released Unlimited-OCR transcribes dozens of pages in one forward pass

Baidu has released Unlimited-OCR, an end-to-end OCR model capable of transcribing dozens of pages in a single forward pass using a novel attention mechanism called Reference Sliding Window Attention (R-SWA). This model, based on DeepSeek-OCR, maintains a fixed visual context while allowing generated text to attend only to a sliding window of previous tokens, significantly reducing memory overhead. Benchmarks indicate a performance improvement to 93.92% on OmniDocBench v1.6 compared to DeepSeek-OCR's 87.01%, although independent validation is recommended before final assessments. The model is available under the MIT license on platforms like Hugging Face and ModelScope, making it a valuable tool for practitioners in the OCR domain.

ocrunlimited-ocrbaidu

33 d agofound 12 d agoThe DecoderResearch

Pangram CEO says language models give themselves away by making the same arguments

Pangram CEO Max Spero highlights a limitation of current language models, noting that while they produce coherent text, their arguments tend to cluster around similar points rather than exhibiting the diverse reasoning found in human discourse. This homogeneity in argumentation may serve as a distinguishing factor for detecting AI-generated content. Understanding this characteristic is crucial for practitioners aiming to enhance the robustness and variability of AI outputs in applications requiring nuanced reasoning.

language_modelsargumentsdiversity

33 d agofound 12 d agoReddit r/LocalLLaMAModels

llama.cpp updates - granite-speech-4.1-2b, LFM2.5-ColBERT/Embedding-350M, Vulkan backend related changes & Misc items

The latest updates to llama.cpp include support for the granite-speech-4.1-2b model and the LFM2.5-ColBERT/Embedding-350M models. Significant enhancements to the Vulkan backend have been made, including support for 3D convolutions and various mathematical operations, which can improve performance in high-throughput scenarios. These updates are crucial for AI practitioners as they enhance model compatibility and computational efficiency, enabling better utilization of GPU resources.

llamaupdatesgranite-speech

33 d agofound 12 d agoThe DecoderAgents

Claude Tag embeds Anthropic's AI in Slack, already writes 65 percent of internal code, company says

Anthropic has released Claude Tag, an integration that allows teams to utilize its AI within Slack by tagging @Claude for task assignments. This tool reportedly generates 65% of the internal code for Anthropic's product team, highlighting its potential to enhance productivity and streamline coding workflows in collaborative environments. For practitioners, this integration demonstrates the increasing utility of AI in real-time coding assistance and team collaboration.

claudeslackinternal_code

33 d agofound 12 d agoThe DecoderRAG

Mistral's new OCR model beats competitors in 72 percent of blind test cases, company says

Mistral AI has announced the release of OCR 4, an optical character recognition model designed to extract text from various document formats, including PDFs, Word files, and PowerPoint presentations. The model reportedly outperforms competitors in 72% of blind test cases, indicating its superior accuracy and effectiveness in real-world applications. This advancement is significant for practitioners in AI and LLMs, as it enhances text extraction capabilities, which are critical for data processing and analysis in diverse applications.

ocrmistralcompetitors

33 d agofound 12 d agoMarkTechPostAgents

Nous Research Adds /learn to Hermes Agent’s Skills System, Capturing Workflows as Slash Commands Without Hand-Writing SKILL.md

Nous Research has introduced the /learn command to the Hermes Agent Skills System, enabling the automatic generation of a standards-compliant SKILL.md from various sources such as local directories, document URLs, and past conversations. This enhancement allows the live agent to autonomously source content and create skills without manual input or a separate ingestion engine, streamlining the workflow for practitioners developing with LLMs by simplifying the skill creation process and reducing potential errors in documentation.

hermesskills_systemworkflows

33 d agofound 12 d agoMarkTechPostCoding

16 Best Generative AI Coding Tools in 2026 Compared: Features, and Best Fit

The article discusses the evolution of generative AI coding tools by 2026, highlighting their capabilities in full application generation and multi-agent build pipelines. It emphasizes the use of large language models trained on code, which can understand context and intent to produce functional software components with minimal manual input. This advancement is significant for practitioners as it streamlines the software development process, enabling faster and more efficient coding practices.

generative_aicoding_tools

33 d agofound 12 d agoReddit r/LocalLLaMAModels

New EU model (Domyn) will be 400b.

The startup Domyn has announced the development of a new 400 billion parameter language model, building on their existing closed 260 billion parameter model, Domyn Large, aimed at enterprise applications, and a smaller 10 billion parameter model available on Hugging Face. This release signifies a significant increase in model size, potentially offering enhanced capabilities for complex language tasks. For AI practitioners, the availability of these models could provide new tools for developing applications that require advanced natural language understanding and generation.

domynmodel

33 d agofound 12 d agoMarkTechPostInference

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

UC San Diego's DFlash introduces a block diffusion model for speculative decoding, allowing for the drafting of whole token blocks in a single forward pass with key-value (KV) injection for conditioning. The model achieves a reported 6.08x lossless speedup on the Qwen3-8B model and up to 15x throughput on NVIDIA's Blackwell architecture, while supporting frameworks like SGLang, vLLM, and TensorRT-LLM. This advancement is significant for practitioners as it enhances decoding efficiency and throughput, which are critical for real-time applications in AI.

speculative_decodingthroughputnvidia

33 d agofound 12 d agoReddit r/LocalLLaMAModels

Qwen-AgentWorld-397B-A17B

The Qwen-AgentWorld-397B-A17B model has been announced, expanding on the previously released Qwen-AgentWorld-35B-A3B. While specific technical details such as model size and benchmark results are not provided in the current content, this release indicates ongoing advancements in the Qwen series, which may offer enhancements in performance and capabilities for practitioners working with large language models.

qwenmodel

33 d agofound 12 d agoOpenAI NewsInference

OpenAI and Broadcom unveil LLM-optimized inference chip

openaibroadcomllmchip

33 d agofound 12 d agoReddit r/LocalLLaMAModels

Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT

The Unlimited-OCR model, a 3.3 billion parameter multilingual OCR system, has been released on ModelScope, enabling one-shot parsing across single images, multi-page documents, and PDFs. It supports full-document parsing with a maximum output length of 32K tokens, offers base and gundam image modes for various document layouts, and utilizes Transformers inference with SGLang for OpenAI-compatible streaming requests. This model enhances capabilities in document parsing, making it significant for developers focusing on advanced OCR applications.

ocrmultilingualmodelscope

33 d agofound 12 d agoReddit r/LocalLLaMAAgents

Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments

Qwen has released the Qwen-AgentWorld-35B-A3B, a 35 billion parameter mixture of experts (MoE) model that activates approximately 3 billion parameters per token. This model is designed to simulate various environments, including MCP, terminal, software engineering, Android, web, and OS interactions, by predicting the next state based on an agent's actions. It is particularly relevant for practitioners focused on agent training, offline evaluation, and the development of synthetic environments for tool-use workflows.

qwenmodelagent

33 d agofound 10 d agoarXiv cs.AIAgents

E-MRL: Cross-view Aligned Evidence-driven Multimodal Reinforcement Learning for Reliable 3D Tumor Analysis

The article introduces E-MRL (Evidence-driven Multimodal Reinforcement Learning), a novel framework designed to enhance 3D tumor analysis by addressing visual hallucinations in Vision-Language Models. E-MRL operates as a Markov Decision Process focusing on "diagnosis-localization-verification" and incorporates a cross-view consistency reward to ensure semantic alignment between diagnostic reports and visual evidence from 3D CT data. Experimental results on large-scale datasets show that E-MRL outperforms traditional Supervised Fine-Tuning and Reinforcement Learning approaches, improving diagnostic accuracy and reliability for practitioners in medical imaging and AI-driven diagnostics.

reinforcement_learningmultimodalmedical

33 d agofound 10 d agoarXiv cs.AITraining

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

The paper introduces "spec learning," a framework designed to align large language models (LLMs) with user preferences at inference time without requiring parameter updates. It utilizes brief user instructions and a small set of preference judgments to create natural-language prompts that condition LLM behavior, demonstrating improved performance over direct preference optimization (DPO) on specialized datasets. This approach enhances interpretability and transparency in model responses, making it a valuable tool for practitioners seeking efficient and effective model steering methods.

spec_learningllm

33 d agofound 10 d agoarXiv cs.AIAgents

Paying to Know: Micro-Transaction Markets for Verified Product Information in Agentic E-Commerce

The article proposes a new model for e-commerce that leverages micro-transaction markets for verified product information, shifting the focus from product matching to acquiring trustworthy data. It outlines an architecture for this system where autonomous buyer agents can pay small amounts to access detailed product information, such as service histories and test reports, thus promoting genuine product quality and competitive pricing. The authors highlight key NLP challenges that arise from this model, including cost-optimal information acquisition and privacy-preserving persona modeling, suggesting these areas warrant further research and development in the field.

e-commerceagentsmicro-transactions

33 d agofound 10 d agoarXiv cs.AIOpen Source

Decentralised AI Training and Inference with BlockTrain

The article introduces Spheroid BlockTrain, a decentralized AI training protocol that partitions models into independently trainable blocks, optimizing local objectives derived from a global target. BlockTrain achieves a cross-entropy of 1.359 on byte-level WikiText, closely matching a full Transformer reference, while allowing each worker to focus on a single block, thus minimizing resource requirements. This approach enhances scalability and efficiency for AI practitioners by enabling decentralized training and inference across multiple hosts, significantly improving performance metrics like perplexity and reducing the need for centralized infrastructure.

decentralized trainingblockchainai systems

33 d agofound 10 d agoarXiv cs.AIAgents

Maestro Order: A Model-Agnostic Orchestration Harness

Maestro Order is introduced as a model-agnostic orchestration harness designed to enhance the reliability of language models by integrating four structural primitives: decompose, ensemble, verify, and recurse, alongside a budget-aware controller for compute allocation. The architecture operates by treating models as black-box solvers and employs a verifier ensemble to improve reliability, achieving significant improvements in reliability metrics (e.g., from 0.55 to 0.999) through strategic verification and voting mechanisms. This framework is crucial for practitioners as it provides a systematic approach to mitigate hallucinations in AI systems, optimizing resource usage while ensuring high reliability in problem-solving tasks.

orchestrationmodel_agnostic

33 d agofound 10 d agoarXiv cs.AIAgents

Subjective-Graph LLM Agents for Simulating Uncertainty in Classroom Social Perception

The article presents a framework for multi-agent LLMs utilizing subjective graphs to simulate uncertainty in social perception within educational settings. The agents employ individualized graphs to manage peer visibility and communication, updating Gaussian belief states through Bayesian fusion. Evaluated across 12 middle-school classrooms, the framework demonstrated a significant increase in collective ranking error, indicating persistent distortions in perceived academic standing, and outperformed traditional DeGroot configurations in maintaining opinion diversity, highlighting its relevance for practitioners interested in modeling social dynamics in AI systems.

social-perceptionuncertainty

33 d agofound 10 d agoarXiv cs.AITraining

Minimisation of Quasar-Convex Functions Using Random Zeroth-Order Oracles

This paper presents a random Gaussian smoothing zeroth-order (ZO) algorithm for minimizing quasar-convex (QC) and strongly quasar-convex (SQC) functions, establishing convergence and complexity bounds for both unconstrained and constrained scenarios. It introduces the concept of proximal-quasar-convexity for constrained optimization and shows that the algorithm can converge to a controlled neighborhood of the global minimum. These findings have practical implications for machine learning applications, particularly in areas like linear dynamical system identification and generalized linear models, where quasar-convexity is relevant.

optimizationquasar-convexitymachine_learning

33 d agofound 10 d agoarXiv cs.AIInference

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

The paper introduces Cavewoman, a two-channel evaluation protocol for assessing the effects of linguistic input and output compression on large language models (LLMs). It evaluates eight models across five datasets, revealing that output compression can reduce inference costs by 1.4-2.4x, while input compression typically increases costs by 1.15x on average, leading to longer, less accurate responses. This research highlights the importance of carefully managing compression strategies in LLM applications, as input compression may degrade performance and inflate operational expenses.

compressionllmcost reduction

33 d agofound 10 d agoarXiv cs.AIResearch

Invariant Graph Representations for Continuous-Time Dynamic Graphs Under Distribution Shifts

The article presents CIR, a novel framework for learning invariant representations in Continuous-Time Dynamic Graphs (CTDGs) under out-of-distribution (OOD) shifts, utilizing a structural causal model called ICCM. It incorporates the Normalized Weighted Geometric Mean (NWGM) for efficient interventional predictions and employs a deep learning architecture with subgraph extractors and an environment memory bank to handle distributional shifts. This advancement is significant for practitioners as it enhances the robustness and applicability of CTDG models in dynamic environments, addressing limitations of existing methods in OOD scenarios.

dynamic_graphsOODrepresentation_learning

33 d agofound 10 d agoarXiv cs.AISafety

Grad Detect: Gradient-Based Hallucination Detection in LLMs

Grad Detect is a novel gradient-based method for detecting hallucinations in Large Language Models (LLMs) by analyzing layer-wise gradient patterns during inference. It leverages the internal gradient structure, which reveals information about output correctness that is not available from output-level signals, and demonstrates superior performance over existing confidence-based and sampling-based methods on various Q&A benchmarks. The approach emphasizes the final five layers of the model, which contain over 97% of the relevant gradient signal, facilitating efficient implementation with minimal performance degradation, thus enhancing the reliability of LLMs in critical applications.

llmhallucinationdetection

33 d agofound 10 d agoarXiv cs.AIResearch

Beyond Bayer: Task-Optimal Sensor Co-Design for Robust Autonomous-Driving Segmentation

The paper presents a novel approach to sensor co-design for autonomous driving segmentation, emphasizing the importance of optimizing camera measurements rather than solely relying on larger models. It introduces a differentiable RAW-to-task pipeline that learns optimal spectral colour-filter-array (CFA) weights, achieving improvements in mean Intersection over Union (mIoU) by +0.017 on the KITTI-360 dataset and +0.023 on ACDC, while demonstrating that co-designing optics leads to negative outcomes. This work is significant for practitioners as it highlights the potential for sensor-level optimizations to enhance model performance in diverse environmental conditions, independent of the downstream model architecture.

autonomous drivingsensor co-designsegmentation

33 d agofound 10 d agoarXiv cs.AIResearch

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

CALIBER (Calibration Before and After Reasoning) is a novel approach that improves confidence estimation in reasoning language models by distinguishing between pre- and post-answer confidence assessments. It reduces Expected Calibration Error (ECE) by 52.5% on the BigMathDigits dataset for a 7B model and achieves competitive results on a larger 30B model, demonstrating significant improvements in calibration, Brier score, and AUROC across various benchmarks, particularly under distribution shifts. This method is crucial for practitioners as it enhances the reliability of model outputs, especially in complex reasoning tasks, by aligning confidence estimates with the model's state of information.

confidencereasoningllm

33 d agofound 10 d agoarXiv cs.AICoding

Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules

The article introduces Rule2Text, a framework that utilizes large language models (LLMs) to generate natural language explanations for complex logical rules derived from knowledge graphs (KGs). Extensive experiments were conducted using datasets like Freebase variants and ogbl-biokg, employing models such as Gemini 2.0 Flash and the open-source Zephyr model, which was fine-tuned for improved explanation quality. This framework enhances KG usability by providing interpretable outputs, making it valuable for practitioners aiming to improve human understanding of KGs through LLM-generated explanations.

knowledge_graphsexplanationsLLM

33 d agofound 10 d agoarXiv cs.AIResearch

Cost-Optimal Decision Diagrams for Stochastic Boolean Function Evaluation

The paper presents a novel branch-and-bound algorithm for the cost-optimal evaluation of stochastic Boolean functions, addressing the challenge of minimizing expected evaluation costs under variable costs and probabilistic truth assignments. This marks the first practical exact algorithm capable of handling such generality, with experimental results demonstrating its scalability and efficiency, alongside a greedy beam-search variant. The findings are significant for practitioners as they provide a new method for decision-making processes in AI applications where cost and efficiency are critical, particularly in domains like medical diagnosis.

decision diagramsstochastic functions

33 d agofound 10 d agoarXiv cs.AIMultimodal

The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models

The article introduces TheProfessor, a multi-teacher unsupervised prompt distillation method that enhances the compression of vision-language models (VLMs) by utilizing a two-teacher ensemble approach. This method employs a domain-finetuned PromptSRC ViT-L/14 and a zero-shot EVA-CLIP-L/14, demonstrating that confidence-weighted ensembling yields significant performance improvements on various datasets, with an average HM increase from 87.52 to 89.28. This advancement is particularly relevant for practitioners as it highlights the effectiveness of multi-teacher strategies in adapting models to domain shifts, potentially leading to better generalization in real-world applications.

prompt_distillationvision_language_models

33 d agofound 10 d agoarXiv cs.AISafety

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

AdversaBench is introduced as an automated red-teaming pipeline for large language models (LLMs), utilizing a combination of five structured mutation operators and a three-judge confirmation process to evaluate model failures. Experiments with 45 seed prompts across reasoning, instruction-following, and tool use categories revealed that the effectiveness of mutation operators varies significantly, with instruction-following prompts requiring more iterations to achieve failure. Notably, adversarial prompts generated for Llama 3.1 8B demonstrated zero-shot transferability to Llama 3.3 70B, indicating that the identified vulnerabilities may reflect general behavioral patterns rather than specific model weaknesses, which is critical for practitioners aiming to enhance LLM robustness.

llmred-teamingadversarial

33 d agofound 10 d agoarXiv cs.AITraining

Adaptive Machine Learning Framework for UAV Trajectory Optimization in O-RAN

The article presents an adaptive machine learning framework for optimizing UAV trajectories within the O-RAN architecture, leveraging continual transfer learning. This framework utilizes a library of pre-trained models and a model selection mechanism to enhance efficiency and minimize adaptation time in dynamic environments, achieving a 44% to 56% reduction in convergence time compared to traditional retraining methods. The integration of real-world city maps and ray tracing techniques not only improves learning reliability but also enhances trajectory planning, which is crucial for practitioners developing UAV applications in 6G networks.

uavtrajectory-optimizationtransfer-learning

33 d agofound 10 d agoarXiv cs.AIMultimodal

G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models

The G$^3$VLA model introduces a camera-aware geometric module for Vision-Language-Action (VLA) systems, enhancing visual-token processing by incorporating calibrated geometric information without modifying the action space. It employs intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion, achieving significant performance improvements across various benchmark suites, including LIBERO and RoboCasa24, particularly in spatially sensitive tasks. This development is crucial for practitioners as it addresses the limitations of traditional VLA models in multi-camera environments, enabling more accurate robot manipulation through better alignment of visual information with physical geometry.

vlaroboticsgeometry

33 d agofound 10 d agoarXiv cs.AIResearch

A Survey on Federated Causal Discovery and Inference

The paper presents a comprehensive survey on Federated Causal Discovery (FCD) and Federated Causal Inference (FCI), addressing the challenges of conducting causal analysis with distributed data while adhering to privacy regulations. It organizes FCD methods based on how structures are learned, data partitioned, and the structural knowledge obtained, and categorizes FCI methods by target estimand and estimation strategy, including classical and deep generative approaches. This work is significant as it formalizes the relationship between FCD and FCI, proposing a unified pipeline that enhances causal reasoning in federated settings while identifying key areas for future research, such as privacy and communication efficiency.

federated learningcausal discoveryinference

33 d agofound 10 d agoarXiv cs.AIAgents

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

The study evaluates the effectiveness of exact-match retrieval recall as a measure of policy utility in long-horizon tool-use agents, specifically using Qwen2.5-3B/7B classifiers within the tau-bench framework. It demonstrates that while a compact structured state improves macro-F1 scores by 0.13-0.17, the retrieval of policy clauses does not significantly differ from gold clauses in terms of classification performance, suggesting that reliance on exact-match recall may misrepresent the utility of retrieved policies. This finding emphasizes the need for practitioners to consider integrating retrieved policies in the classification loop rather than depending solely on recall metrics for evaluating retriever performance.

policy_signaltool_use

33 d agofound 10 d agoarXiv cs.AIResearch

Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery

The paper introduces PC-MCMC-CIGP, a gray-box workflow that integrates spike-and-slab topology sampling with Chemical-Informed Gaussian Processes (CIGP) for enhancing reaction network discovery from sparse chemical data. It demonstrates improved parameter calibration and experimental design, achieving a 12.5% increase in yield on styrene epoxidation compared to a Gaussian Process Bayesian Optimization baseline. This approach is significant for practitioners as it effectively combines MCMC and GP methods under physical constraints, optimizing decision-making in experimental setups while addressing uncertainty in chemical reactions.

mcmcgaussian_processeschemical_reaction

33 d agofound 10 d agoarXiv cs.AICoding

VeriPilot: An LLM-Powered Verilog Debugging Framework

VeriPilot is a newly proposed LLM-powered framework designed to enhance Verilog debugging by utilizing golden reference models for effective bug localization and repair. It employs Control-Data-Flow Graphs (CDFGs) derived from static analysis to facilitate step-by-step signal tracing, significantly improving the bug repair success rate of GPT-4o from 54.3% to 85.71% on the Comprehensive Verilog Design Problems (CVDP) benchmark. This advancement addresses the challenge of tracing long dependency chains in complex codebases, making it a valuable tool for practitioners in digital circuit design.

llmverilogdebugging

33 d agofound 10 d agoarXiv cs.AISafety

A global log for medical AI

The article introduces MedLog, a protocol designed for event-level logging of medical AI interactions, addressing the lack of standardized logging in the medical AI landscape. MedLog captures nine core fields for each event, including model, inputs, outputs, and outcomes, and has been applied in various deployments such as ICU deterioration prediction and automated sepsis quality reporting. This approach enables better tracking of AI performance, facilitates the detection of biases and adverse events, and supports continuous monitoring and improvement, which is crucial for practitioners deploying AI in healthcare environments.

medical-ailogging

33 d agofound 10 d agoarXiv cs.AIResearch

Can Scale Save Us From Plasticity Loss in Large Language Models?

The study investigates plasticity loss in GPT-style Transformer models, focusing on their ability to adapt to new information after prior learning. Analyzed models ranged from 5M to 314M parameters, revealing that plasticity loss occurs even in larger architectures and follows a sublinear scaling law with model size. These findings indicate that while larger models may mitigate the effects of plasticity loss, simply increasing parameter count is insufficient to prevent this issue, impacting the design of continual learning systems in natural language processing.

plasticity lossllmcontinual learning

33 d agofound 10 d agoarXiv cs.AIRAG

MMed-Bench-IR: A Heterogeneous Benchmark for Multilingual Medical Information Retrieval

MMed-Bench-IR is a newly introduced benchmark for multilingual medical information retrieval, addressing the need for cross-lingual alignment, concept discrimination, and evidence retrieval across six languages. It comprises three tasks: cross-lingual medical QA retrieval with 6,127 queries based on the Unified Medical Language System, concept discrimination using 4,975 confusion sets, and multilingual evidence retrieval with 2,040 queries, all designed without overlap to accurately assess capabilities. The benchmark highlights significant performance gaps in biomedical encoders, with nDCG@10 scores dropping from 0.818 in English to 0.056 in Japanese, underscoring the limitations of existing English-only benchmarks for evaluating multilingual systems in clinical contexts.

multilingual retrievalbenchmarkmedical information

33 d agofound 10 d agoarXiv cs.AIResearch

RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting

The Regime-Aware Variable-context Expert Network (RAVEN) has been introduced as a Mixture-of-Experts framework specifically designed for financial time series forecasting. RAVEN adapts its temporal context dynamically rather than relying on a fixed look-back window, utilizing a Cumulative Importance Thresholding mechanism to create nested context windows and incorporating a Global Compressed Representation for enhanced temporal coherence. Experimental results indicate that RAVEN outperforms state-of-the-art models, achieving significant improvements in Pearson correlation and mean squared error across multiple financial datasets, which is crucial for practitioners dealing with non-stationary financial data.

financial_forecastingtime_series

33 d agofound 10 d agoarXiv cs.AIResearch

Catastrophic Compositional Generation: Why Vanilla Diffusion Models Fail to Extrapolate

The paper presents a critical analysis of vanilla conditional diffusion models in the context of compositional generation, arguing that they struggle to extrapolate to target distributions defined by combinations of source distributions. The authors provide theoretical insights and experimental evidence indicating that score estimation errors significantly hinder performance, particularly when dealing with out-of-distribution targets, thus suggesting the necessity for alternative methodologies. This work is relevant for AI practitioners as it highlights the limitations of current diffusion models and the need for improved approaches in generative tasks involving unseen combinations of data.

compositional_generationdiffusion_models

33 d agofound 10 d agoarXiv cs.AIAgents

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

This article presents a reinforcement learning approach for optimizing real-time event triggering at the Large Hadron Collider (LHC), addressing the limitations of static, hand-tuned trigger menus. The authors adapt Group-Filtered Policy Optimization (GFPO) for streaming control, achieving significant improvements in signal efficiency and in-tolerance rates for both total transverse energy and anomaly-detection triggers, with gains of up to 56% in real collision data without fine-tuning. This work is significant as it demonstrates the first application of RL for trigger control in real LHC data, potentially enhancing the efficiency of data collection in high-energy physics experiments.

reinforcement_learninglarge_hadron_collider

33 d agofound 10 d agoarXiv cs.AIResearch

Grounded Chess Reasoning in Language Models via Master Distillation

The paper introduces a framework called Master Distillation for enhancing language models' reasoning capabilities in specialized domains, exemplified by a 4B parameter model named C1 applied to chess. C1 achieved 48.1% accuracy, outperforming existing open-source and proprietary models, while generating explanations with significantly fewer tokens than baseline methods. This approach captures the full reasoning process of expert systems, enabling compact models to produce transparent, explainable solutions, which is crucial for practitioners seeking to integrate grounded reasoning in AI applications.

grounded-reasoningllmchess

33 d agofound 10 d agoarXiv cs.AIMultimodal

UniDrive: A Unified Vision-Language and Grounding Framework for Interpretable Risk Understanding in Autonomous Driving

UniDrive is a novel unified vision-language and grounding framework designed for interpretable risk understanding in autonomous driving, addressing the limitations of existing multimodal large language models (MLLMs) in temporal reasoning and spatial precision. The architecture features a dual-branch system: a temporal reasoning branch for multi-frame scene dynamics and a high-resolution perception branch for fine-grained spatial details, integrated via a gated cross-attention fusion module. Benchmark results on the DRAMA-Reasoning dataset indicate that UniDrive surpasses image-based and video-based baselines in risk-object localization and interpretability, highlighting its potential for enhancing safety in autonomous driving systems.

autonomous-drivingrisk-understandingvision-language

33 d agofound 10 d agoarXiv cs.AIAgents

From Task-Guided Conversational Graphs to Goal-Oriented Dialogue Runtimes

The paper introduces the Goal-Oriented Dialogue Runtime (GODR), a conceptual framework designed to enhance conversational continuity in complex, multi-domain interactions involving interdependent objectives. GODR treats goals, task frames, and lifecycle states as first-class runtime objects, enabling better management of suspended, resumed, or invalidated goals, and is intended to work alongside existing orchestration frameworks rather than replace them. This framework is significant for practitioners as it addresses the challenges of maintaining conversational coherence in sophisticated dialogue systems, paving the way for more robust multi-agent interactions.

dialogue_systemsgoal_orientedllm

33 d agofound 10 d agoarXiv cs.AIResearch

Multimedia and Visual Analytics in the Agentic Era

The paper presents a framework aimed at integrating multimedia and visual analytics to enhance actionable insights for professional users handling large multimedia collections. It highlights the need for improved accuracy, trustworthiness, and reasoning capabilities in foundation models and AI agents, suggesting a shift from purely algorithmic improvements to comprehensive multimedia analytics systems. This approach is significant for practitioners as it emphasizes the importance of user-centric design in AI tools, facilitating better collaboration between humans and AI in complex analytical tasks.

multimediavisual_analyticsAI_agents

33 d agofound 10 d agoarXiv cs.AIMultimodal

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

The paper introduces Head-Wise Representation Alignment (HeRA), a novel technique for improving Multimodal Large Language Models (MLLMs) by enforcing alignment at the individual attention head level rather than a fixed layer. HeRA utilizes the Mutual K-Nearest Neighbor (MKNN) alignment metric and a contrastive objective to enhance cross-modal representation alignment, leading to improved performance on vision-centric tasks and reducing visual hallucinations. This method is significant for practitioners as it offers a more granular approach to multimodal training, potentially leading to better model robustness and accuracy in vision-related applications.

mllmrepresentation_alignmenttransformers

33 d agofound 10 d agoarXiv cs.AITraining

Variational Model Merging for Pareto Front Estimation in Multitask Finetuning

The article introduces a new Bayesian approach called Variational Model Merging, aimed at enhancing the quality of Pareto front estimates in multitask finetuning by using flexible non-Gaussian posteriors. This method builds on existing model-merging techniques and demonstrates that utilizing more complex posterior distributions leads to superior estimates of Pareto fronts, validated through empirical results on vision and language transformers. This advancement is significant for practitioners as it provides a more efficient way to determine optimal task-mixing strategies, potentially reducing computational costs associated with Pareto front estimation.

finetuningpareto_frontsmodel_merging

33 d agofound 10 d agoarXiv cs.AITraining

Impatient Bandits: Optimizing for the Long-Term Without Delay

The paper presents a novel approach to optimizing recommender systems for long-term user satisfaction by addressing the challenge of delayed rewards in a bandit framework. It introduces a predictive model that integrates historical data to estimate delayed rewards and a bandit algorithm that leverages this model to identify content that promotes sustained user engagement. The proposed method shows significant improvements over traditional short-term and delayed reward optimization strategies in a large-scale podcast recommendation system, highlighting its practical applicability for enhancing user experience in real-world applications.

banditslong_termrecommender_systems

33 d agofound 10 d agoarXiv cs.AIResearch

Token Complexity of Certifying Stochastic-Oracle Reliability

The paper introduces a framework for certifying the reliability of stochastic oracles, defining "certification token complexity" as the minimum expected token cost to distinguish between oracles that meet a specified reliability level and those that do not. It presents a Sequential Probability Ratio Test (SPRT)-based Stochastic-Oracle Turing Machine (SOTM) that effectively queries oracles and computes correctness scores while ensuring two-sided error guarantees. This work is significant for practitioners as it provides theoretical bounds on token complexity, informing the design of efficient certification processes in AI systems that rely on stochastic oracles.

stochastic_oracletoken_complexity

33 d agofound 10 d agoarXiv cs.AITraining

Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models

This study introduces the Match Task to Objective (MTO) framework, which optimizes encoder-decoder pre-trained language models by aligning pre-training objectives with specific tasks, enhancing performance in generation and question answering tasks, particularly in commonsense knowledge retrieval. The framework employs automated methods for unsupervised data preparation and novel fine-tuning templates, achieving over 120% performance improvement in few-shot settings compared to conventional methods. The findings provide critical insights for practitioners on model customization and prompt-tuning strategies, with the accompanying code available for implementation.

fine-tuningprompt-tuninglanguage models

33 d agofound 10 d agoarXiv cs.AISafety

Selective Capability Unlearning in End-to-End Spoken Language Understanding

The article introduces a novel framework called Binding Subspace (BSU) for selective capability unlearning in end-to-end spoken language understanding (SLU) systems. BSU addresses the issue of capability persistence, where autoregressive models fail to fully suppress specific intents due to their conditional mapping behavior, by isolating and attenuating intent-conditioned representations. This approach significantly reduces the recoverability of suppressed intents while maintaining performance on SLU benchmarks, which is crucial for practitioners needing to comply with safety and policy constraints in deployed systems.

spoken_language_understandingcapability_unlearning

33 d agofound 10 d agoarXiv cs.AIAgents

Reward-Centered ReST-MCTS: A Robust Decision-Making Framework for Robotic Manipulation in High Uncertainty Environments

The paper introduces Reward-Centered ReST-MCTS, a decision-making framework designed to enhance Monte Carlo tree search (MCTS) for robotic manipulation in uncertain environments. It decomposes feedback into multiple channels—rule, heuristic, neural, and value estimation—allowing for improved search bias and robustness against challenges such as sparse rewards and noisy transitions. This framework is significant for AI practitioners as it provides a structured approach to improving decision-making in high-uncertainty scenarios without necessitating a fully differentiable policy.

roboticsdecision_makingMCTS

33 d agofound 10 d agoarXiv cs.AIModels

ScaleToT: Generalizing Structured LLM Reasoning for Billion-Scale Low-Activity User Modeling

ScaleToT is a new model designed to enhance user modeling for billions of low-activity users by leveraging structured reasoning techniques. It employs a bounded entropy-guided Tree-of-Thought (ToT) refinement to create typed user-state chains from a small LLM-processed subset, which are then used to train a lightweight profile encoder via supervised fine-tuning and Outcome-Driven Segment-Aware Implicit Reward Policy Optimization (OSIPO). This approach significantly reduces computational costs associated with LLM inference while improving lifetime value (LTV) prediction, as demonstrated by a 6.738% increase in LT30 during a billion-scale advertising deployment.

llmuser modelingstructured reasoning

33 d agofound 10 d agoarXiv cs.AITraining

Task Decomposition for Efficient Annotation

The article introduces a method for task decomposition in structured annotation to enhance efficiency and reduce the inferential load on annotators. It presents a formal model based on centering theory to identify salient anchor entities, allowing for the effective breakdown of complex annotation tasks into manageable sub-tasks. This approach not only improves cost-efficiency but also optimizes the allocation of sub-tasks among heterogeneous annotators, which is crucial for practitioners aiming to streamline annotation processes in large-scale AI projects.

annotationstructured-dataefficiency

33 d agofound 12 d agoarXiv cs.AIInference

CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

CompressKV is a newly proposed framework for compressing key-value (KV) caches in long-context large language models (LLMs), specifically targeting GQA-based architectures. It introduces the concept of Semantic Retrieval Heads (SRHs) to selectively retain critical tokens based on their semantic importance, significantly improving resource efficiency. In experiments, CompressKV maintained over 97% of full-cache performance using only 3% of the KV cache on LongBench and achieved 90% accuracy with just 0.7% KV storage on Needle-in-a-Haystack, highlighting its potential for optimizing memory usage in LLM inference.

kv-cachellmcompression

33 d agofound 12 d agoarXiv cs.AIResearch

On the Smallness of the Large Language Models Scaling Exponents

The article discusses the implications of scaling exponents in Large Language Models (LLMs), highlighting their indication of an unsustainable energy consumption regime. It critiques the notion that the smallness of these exponents is merely a numerical bias related to the "pedestal effect" and emphasizes that this does not resolve the sustainability concerns. Additionally, it explores how data characteristics, such as smoothness and roughness, influence scaling exponents, drawing parallels with fluid turbulence models, which may inform future model design and efficiency considerations for practitioners.

scalingllmsustainability

33 d agofound 10 d agoarXiv cs.AIResearch

Structural Kolmogorov-Arnold Convolutions: Learnable Function on the Values or the Filter Shape as Parameter-Efficient Alternative to Per-Edge Convolutional KANs

The article presents Structural Kolmogorov-Arnold Networks (KANs), which introduce a parameter-efficient approach by placing learnable functions in the convolution structure rather than on each edge. Three models are studied: SV-KAN, AG-KAN, and RF-KAN, with RF-KAN achieving 88.47% accuracy on CIFAR-10 using approximately 0.4M parameters, outperforming traditional convolutional methods and demonstrating the importance of content-adaptive filter shapes. This work highlights a significant reduction in parameters while maintaining high performance, making it relevant for practitioners seeking efficient architectures in deep learning.

convolutional networksparameter-efficient

33 d agofound 12 d agoarXiv cs.AIResearch

Towards Federated Long-Tailed Graph Learning: An Energy-Guided Dual Decoupling Approach

The paper introduces FedEPD, a novel framework for Federated Graph Learning that addresses the challenges posed by long-tailed data distributions. It employs a dual decoupling approach to separate topological purification from semantic recalibration, utilizing distribution-aware Dirichlet energy pruning and a two-stage alternating optimization strategy. FedEPD achieves state-of-the-art performance, with improvements of up to 4.97% in accuracy and 5.48% in Macro-F1 across various long-tailed benchmarks, making it significant for practitioners dealing with imbalanced data in collaborative environments.

federated learninggraph learninglong-tailed

33 d agofound 12 d agoarXiv cs.AIResearch

Cycle-Consistent Neural Explanation of Formal Verification Certificates

The paper introduces a cycle-consistent neural architecture designed to generate natural language explanations for formal verification certificates, comprising a forward network (NN1) and an inverse network (NN2) that together ensure faithful reconstruction of the certificates. Evaluated on 420 test certificates from various verification methods, the model achieves 90.0% cycle-verified soundness, significantly outperforming a multi-LLM few-shot baseline by 13.9 percentage points, while also providing 860x faster inference times and deterministic outputs. This advancement is crucial for practitioners as it enables efficient, offline explanations of verification results, enhancing accessibility for non-specialists without the reliance on cloud-based systems.

formal-verificationneural-networksexplanations

33 d agofound 12 d agoarXiv cs.AITraining

Accelerating Disaggregated RL for Visual Generative LLMs with Diffusion-Based Parallelism and Trainer-Assisted Generation

The article introduces DigenRL, a disaggregated reinforcement learning framework designed for diffusion-based generative large language models (LLMs). Key innovations include a generation-axis pipeline (GAP) and time-step parallelism (TSP) for enhanced pipelining, an elastic trainer-assisted generation (TAG) approach for dynamic resource allocation, and an asynchronous strategy to optimize pipeline utilization. Experimental results demonstrate that DigenRL achieves throughput improvements of 1.56 to 2.10 times over existing systems like veRL-Omni and GenRL, making it a significant advancement for practitioners working on efficient RL systems in generative AI.

reinforcement-learningllmdiffusionagents

33 d agofound 12 d agoarXiv cs.AIAgents

Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control

The article presents a hierarchical multi-agent reinforcement learning (MARL) framework that integrates constraint manifold control to enforce hard safety constraints while enabling coordination among agents. This approach provides theoretical safety guarantees and achieves stationary learning dynamics, leading to stable and efficient training. Empirical results demonstrate competitive performance with nearly perfect safety rates, making it significant for practitioners focused on safety-critical applications in multi-agent systems.

multi-agentreinforcement-learningsafety

33 d agofound 12 d agoarXiv cs.AICoding

Navigating User Behavior toward Personalized Multimodal Generation

The paper introduces NaviGen, a novel approach for personalized multimodal content generation that enhances alignment between user intent and generated outputs. It utilizes a dual identifier system combining collaborative and textual codes to encode user behavior, followed by a two-stage training pipeline of supervised fine-tuning (SFT) and reinforcement learning (RL) to improve instruction writing and preference reasoning. Experimental results demonstrate that NaviGen significantly enhances the quality of personalized image and video generation, making it a valuable tool for practitioners seeking to refine user interaction in AI-generated content.

personalized generationAIGCuser behavior

33 d agofound 12 d agoarXiv cs.AIResearch

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

The article introduces the "Age of LLM," a new turn-based benchmark designed for evaluating the reasoning, diplomacy, and reliability of large language models (LLMs) in a competitive setting on a 13x7 grid. It emphasizes the impact of fog of war and strict JSON schema adherence, benchmarking 15 models across 54 matches and 5,258 actions, revealing insights such as the dominance of nuclear strategies and the relationship between reliability and winning outcomes. This benchmark provides a unique framework for practitioners to analyze LLM behavior under adversarial conditions, particularly in terms of belief tracking and cognitive strategies, with resources available for further exploration.

benchmarkllmreasoningdiplomacy

33 d agofound 10 d agoarXiv cs.AITraining

SEAL: Searching Expandable Architectures for Incremental Learning

SEAL is a newly introduced framework that integrates Neural Architecture Search (NAS) for data-incremental learning, addressing the challenge of balancing model plasticity and stability. It dynamically expands the model architecture only when necessary, guided by a capacity estimation metric, and employs cross-distillation training to mitigate forgetting. Experimental results show that SEAL improves accuracy and reduces resource usage, making it a promising approach for efficient incremental learning in resource-constrained environments.

incremental_learningNASdeep_learning

33 d agofound 10 d agoarXiv cs.AIResearch

Transformation Behavior of Images in Latent Space

The paper investigates the transformation behavior of images in latent space for histopathology classification, focusing on encoder networks from Lunit Inc., Bioptimus, and the Meta Research Team. It finds that while embeddings of original and transformed images maintain proximity, indicating robustness, they are not entirely invariant to transformations, highlighting the need for tailored encoder training to enhance performance in downstream tasks. This research underscores the importance of understanding latent space behavior for improving data augmentation strategies in histopathological applications.

latent spaceimage transformation

33 d agofound 10 d agoarXiv cs.AIResearch

A Fair Evaluation of Graph Foundation Models for Node Property Prediction

The paper presents a comprehensive evaluation of nine recent Graph Foundation Models (GFMs) for node property prediction, addressing the lack of standardized benchmarks in the field. Key findings indicate that only the latest GFMs leveraging the Prior-data Fitted Networks paradigm surpass well-tuned Graph Neural Networks (GNNs) in predictive performance, albeit with increased inference costs. This work is crucial for practitioners as it provides a clearer understanding of the trade-offs between GFMs and GNNs, facilitating informed model selection for applications in fraud detection and recommendation systems.

graphnode-predictionevaluation

33 d agofound 12 d agoarXiv cs.AIRAG

T2D-Bench: Evidence-Gated Evaluation of LLM Outputs for Type 2 Diabetes Using a Multi-Layer Clinical-Lifestyle Knowledge Graph

T2D-Bench is a new benchmark and evaluation framework designed to assess the compliance of large language model (LLM) outputs with explicit clinical guidelines for type 2 diabetes, utilizing a multi-layer clinical-lifestyle knowledge graph. It integrates biomedical data sources and ADA Standards of Care rules to evaluate LLM performance against 100 structured vignettes, revealing that baseline outputs from models like GPT-4o-mini and GPT-4 failed evidence-path checks in 35% and 33% of cases, respectively. This framework enables practitioners to identify and rectify unsupported clinical omissions in LLM-generated recommendations, enhancing their reliability in medical contexts.

LLMbenchmarktype 2 diabetesevaluation

33 d agofound 10 d agoarXiv cs.AIRAG

Quantifying Prior Dominance in RAG Systems

The paper introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, addressing limitations in current heuristics that conflate contextual information extraction with memory recall. It analyzes models ranging from 1.5B to 72B parameters, revealing that Small Language Models (SLMs) can outperform larger architectures in strict factual extraction, highlighting diminishing returns in scaling. The study emphasizes the importance of model architecture and alignment, noting significant issues with commercial APIs, including negative transfer and reliance on parametric priors over external evidence, which could inform practitioners about the effectiveness of different model sizes in RAG workflows.

ragretrieval-augmented generationcontextual information

33 d agofound 10 d agoarXiv cs.AIInference

FP8 is All You Need (Part 2): Efficient Ozaki-Bailey Style FFT Through Tensor-core Garner Reformulation and Kulisch Escape Route

NVIDIA's Blackwell Ultra (B300) GPU introduces a novel approach to achieving FP64-equivalent throughput for 3-D FFTs by utilizing FP8 tensor cores through the Ozaki-Bailey FFT framework. This method leverages a mantissa-sliced Chinese-remainder reconstruction and integrates Kulisch fixed-point arithmetic to maintain FP64 accuracy while operating on INT32, with projected performance for 1024^3 FFTs at approximately 18 ms. This advancement is significant for practitioners, as it enables efficient utilization of lower-precision computations in memory-bound workloads, paving the way for a dedicated libKulisch library and benchmarking efforts.

ffttensor-coreoptimization

33 d agofound 10 d agoarXiv cs.AIResearch

SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization

The article introduces SURGELLM, a unified transformer framework designed to enhance performance across diverse NLP tasks by addressing issues such as inductive bias mismatch and class-imbalance in feature statistics. Key innovations include a surgical feature gate, task-conditioned prefix tokens, and Instance-Weighted Normalization (IWN), which collectively improve macro-F1 scores, achieving 0.940 in benchmarks across four tasks. This framework is significant for practitioners as it provides a method to optimize multi-task learning by leveraging task-specific features and normalization techniques, potentially leading to more robust model performance in heterogeneous environments.

multi-taskevaluationtransformer

33 d agofound 10 d agoarXiv cs.AITraining

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

The article introduces EMAgnet, a novel regularization method for policy gradient self-play in large games, which uses an exponential moving average (EMA) of the last-iterate policy's parameters as a dynamic target for regularization. This approach adapts to the agent's evolving strategy, leading to improved performance over traditional uniform distribution targets, particularly in two-player zero-sum games with exploration challenges and dominated strategies. EMAgnet demonstrates lower exploitability compared to PPO with uniform regularization, making it a significant advancement for practitioners working on reinforcement learning in complex game environments.

policy_gradientself_play

33 d agofound 10 d agoarXiv cs.AITraining

FlowPipe: LLM-Enhanced Conditional Generative Flow Networks for Data Preparation Pipeline Construction

FlowPipe introduces a novel framework for constructing data preparation pipelines using Conditional Generative Flow Networks (C-GFlowNets) and a Trajectory Balance objective to improve long-horizon credit assignment and exploration efficiency. By integrating Deep Semantic Modulation via Feature-wise Linear Modulation (FiLM), it allows for better conditioning of the pipeline decisions based on dataset semantics. Evaluated on 74 real-world datasets, FlowPipe demonstrates an average accuracy improvement of 11.96% and a 12.5x increase in training convergence speed compared to state-of-the-art methods, making it a significant advancement for practitioners in automated data pipeline construction.

data-preparationpipelineml

33 d agofound 10 d agoarXiv cs.AIResearch

MGI: Member vs Generated Inference

The article introduces the concept of Member vs Generated Inference (MGI), which addresses the challenge of distinguishing between samples from a generative model's training set and samples generated by the model itself. It presents a novel method called Data Circuit Breaker (DCB), which utilizes a three-stage approach combining signals from an autoencoder and latent generator, demonstrating effectiveness across various generative models, including image autoregressive and diffusion models. This advancement is significant for practitioners as it enhances the reliability of membership inference in scenarios where models may reproduce training data, thereby improving the security and integrity of generative AI applications.

generative_modelsmembership_inferencedata_security

33 d agofound 10 d agoarXiv cs.AIResearch

BioPIE: A Biomedical Protocol Information Extraction Dataset for Experiment Understanding

The Biomedical Protocol Information Extraction Dataset (BioPIE) has been released to enhance the extraction of structured knowledge from biomedical experiments, addressing challenges such as High Information Density (HID) and Multi-Step Reasoning (MSR). BioPIE provides procedure-centric knowledge graphs that detail entities, actions, and relations, facilitating fine-grained understanding of experimental protocols. This dataset allows for improved evaluation of information extraction methods and supports the development of question answering systems, ultimately aiding practitioners in laboratory automation and cross-disciplinary communication.

biomedicaldata-extraction

33 d agofound 10 d agoarXiv cs.AIAgents

Emergent Relational Order in LLM Agent Societies: From Collective Affect to Authority Stratification

The article introduces CAREB-MAS, a multi-agent framework designed to explore long-term social structures in agent societies using principles from Affect Control Theory and Social Identity Theory. The framework enables agents to develop egocentric identities and interact based on minimal protocols, leading to the emergence of five key phenomena associated with Differential Order, including stable labor specialization and emergent relational authority. This research highlights the potential of LLM-based simulations to provide insights into social dynamics and structures, which is crucial for practitioners aiming to model complex social interactions in AI systems.

multi_agent_systemssocial_dynamicsllm

33 d agofound 10 d agoarXiv cs.AIResearch

EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence

The Evidence-Grounded Video Question Answering Benchmark (EG-VQA) has been introduced to address the gap between answer correctness and evidence grounding in VideoQA, consisting of 2,067 videos and 11,838 QA pairs with detailed temporal evidence annotations. The benchmark employs a new metric, Evidence-Grounded F1 (EG-F1), to evaluate both temporal alignment and semantic consistency of predictions against ground-truth evidence. Results indicate that existing models, including proprietary ones, struggle with evidence localization, highlighting the need for structured evidence supervision, which is addressed by the proposed EG-Reasoner model that achieves state-of-the-art performance among open-source models, particularly on reasoning-intensive tasks.

video-llmbenchmarkqa

33 d agofound 10 d agoarXiv cs.AIMultimodal

OrbitForge: Text-to-3D Scene Generation via Reconstruction-Anchored Video Synthesis

OrbitForge is a novel adapter designed for text-to-3D scene generation, utilizing frozen video priors and Gaussian Splatting reconstruction optimization to convert text-generated videos into consistent 3D Gaussian Splatting scenes. It leverages Deformable Gaussian Splatting for initial reconstruction and completes missing views using the text-to-video model, achieving a median span of 359.0 degrees on the T3Bench-derived audit and significantly improving the ImageReward metric from 8.07 to 16.36. This approach streamlines the process without requiring task-specific fine-tuning, making it a valuable tool for practitioners aiming to enhance 3D consistency in generated scenes.

text-to-3dvideo-synthesis

33 d agofound 10 d agoarXiv cs.AIResearch

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

The paper introduces CORE-Bench, a benchmark designed to assess the computational reproducibility of AI agents in scientific research, comprising 270 tasks derived from 90 studies across multiple disciplines. It evaluates agents like AutoGPT and CORE-Agent using GPT-4o and GPT-4o-mini, with the top agent achieving only 21% accuracy on the most challenging tasks, highlighting significant room for advancement in automating scientific processes. This benchmark is crucial for enhancing reproducibility in research, enabling the development of more capable AI agents that can not only replicate but also innovate in scientific inquiry.

agentsreproducibilitybenchmark1 · 0 cmts

33 d agofound 10 d agoarXiv cs.AIResearch

FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

FISHER, a foundation model for multi-modal industrial signal representation, addresses the M5 problem of data heterogeneity through a novel sub-band modeling approach that effectively manages variable sampling rates without resampling. Pre-trained via teacher-student self-distillation on external audio and music data, FISHER demonstrates superior performance against 24 state-of-the-art series encoders, achieving up to 16x smaller model sizes while maintaining high diagnostic accuracy. The establishment of the RMIS benchmark, which includes 19 datasets across four modalities, provides a robust framework for evaluating multi-modal industrial signal processing, making FISHER a significant advancement for practitioners in the field.

foundation_modelindustrial_signalsdata_analysis

33 d agofound 10 d agoarXiv cs.AITraining

Fast and Slow Variational Continual Learning

The paper introduces the Continual IVON (CoVON) optimizer, which integrates fast and slow adaptation mechanisms into the Variational Continual Learning (VCL) framework to enhance continual learning in deep networks. By merging past posteriors to create a stable prior for fast-weight updates, CoVON demonstrates superior performance over existing VCL optimizers and traditional weight-regularization methods in domain-incremental learning and fine-tuning of large language models. This advancement is significant for practitioners as it provides a more effective optimization strategy for maintaining model performance during continual learning scenarios.

continual_learningoptimization

33 d agofound 10 d agoarXiv cs.AIAgents

Toward Self-Evolution-Ready Workflow Harnesses: A Reversible Migration Path and Convertibility Taxonomy for Expert LLM Pipelines

The article presents a framework for evolving expert "LLM + script" workflows into adaptable systems through a reversible migration path, termed the Strangler-Fig approach. This framework introduces a three-tier convertibility taxonomy (A/B/C) that assesses and routes legacy workflows into composable, typed, and auditable stages, addressing the need for dynamic adaptation based on feedback. This development is significant for practitioners as it provides a structured method to modernize existing workflows, enhancing their flexibility and responsiveness in AI applications.

workflowllmmigration

33 d agofound 10 d agoarXiv cs.AIResearch

Beyond the Autoregressive Horizon: A Comprehensive Survey of Diffusion Models, World Modelling, and State Space Models for Code

The paper presents a survey on the limitations of autoregressive (AR) models in automated software engineering and explores alternative paradigms such as Diffusion Models, Code World Models (CWMs), and State Space Models (SSMs). Diffusion Models address the shortcomings of AR by enabling holistic denoising for long-range syntactic constraints, while CWMs and SSMs enhance reasoning and efficiency in code generation. This research is significant for practitioners as it highlights potential architectural advancements that could improve code intelligence and reasoning capabilities in AI systems.

diffusion modelscode generationautoregressive models

33 d agofound 10 d agoarXiv cs.AISafety

One Year Later...The Harms Persist, But So Do We!

This study evaluates six proprietary large language models (LLMs) in the context of mental health, assessing their performance across 16 DSM-5 conditions using four adversarial attack variants. An eight-dimension harm taxonomy and a multi-dimensional evaluation framework were introduced, revealing that safeguards are effective primarily for suicide and self-harm, while models failed to protect against risks associated with eating disorders, substance use, and major depressive disorder, with failure rates reaching 100%. The findings underscore the urgent need for clearly defined harm categories and robust safety measures in the deployment of LLMs in sensitive applications to mitigate risks to vulnerable populations.

llmmental_healthsafeguardsclinical

33 d agofound 10 d agoarXiv cs.AIMultimodal

Inclusive Interactive Collisions for Multi-View Consistent Compositional 3D Generation

The article introduces I2C-3D, an optimization-based method aimed at generating multi-view consistent compositional 3D assets that address the challenges of interaction modeling among Gaussian primitives and cross-view inconsistency. Key innovations include the Inclusive Interactive Collisions strategy for physically plausible interactions and a Multi-View Adaptive Score Distillation Sampling technique that enhances multi-view consistency by modulating attention maps across viewpoints. This advancement is significant for practitioners as it allows for the creation of high-fidelity 3D scenes with improved interaction realism and flexibility in 3D editing.

3d generationtext-to-imagecompositional

33 d agofound 10 d agoarXiv cs.AISafety

Are Safety Guarantees in Neural Networks Safe? How to Compute Trustworthy Robustness Certifications

The paper introduces a novel approach to computing robustness certifications for neural networks, focusing on the apothem measure, which allows for the determination of apothem-optimal certifications with a linear number of calls to a neural network verifier. It highlights the limitations of existing volume-optimal certification methods due to intractability and presents dual certifications that provide tighter bounds. The proposed ParallelepipedoNN system demonstrates at least a two-fold improvement in minimum edge length on the MNIST and Fashion MNIST benchmarks, which is significant for practitioners seeking efficient methods to enhance neural network robustness against adversarial attacks.

neural_networksrobustness_certificationsadversarial_examples

33 d agofound 10 d agoarXiv cs.AIInference

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

CrossPool is a new serving engine designed for cold mixture of experts (MoE) models, addressing GPU memory inefficiencies by separating feedforward network (FFN) weights and key-value (KV) caches into distinct memory pools. This architecture allows for dynamic KV-cache allocation based on active demand while consolidating FFN weights across multiple models, significantly improving GPU memory utilization and supporting long-context requests. CrossPool demonstrates a performance improvement over existing KV-cache-based multi-LLM serving systems, achieving up to a 10.4x reduction in P99 tail latency, which is crucial for practitioners aiming to optimize resource allocation and response times in LLM deployments.

llmservingmemory

33 d agofound 10 d agoarXiv cs.AIResearch

Towards Version-aware Operations and Transaction Memories for Multi-layer MeMo

The paper introduces MeMo, a framework utilizing multi-layer correlation matrix memories (CMMs) to facilitate version-aware operations in language models, allowing for efficient knowledge updates without full retraining. It proposes a version-aware operation layer that includes high-level functions such as replace, obsolete, and rollback, which are implemented as primitive calls over sequences and tokens. This architecture aims to enhance the adaptability of language models by enabling structured edits and maintaining historical data, thereby improving the efficiency and effectiveness of knowledge management in AI systems.

language_modelsmemory

33 d agofound 10 d agoarXiv cs.AIAgents

DeepBD: A Grounded Agentic Workflow for Variant Prioritization and Diagnosis of Genetic Birth Defects

DeepBD is a novel workflow designed for the prioritization and diagnostic interpretation of genetic variants associated with birth defects. It integrates a pretrained evidence engine that evaluates patient-specific variant scores using structured rule evidence and phenotype-conditioned biological context, achieving Recall@1/3/5/10 scores of 0.658/0.882/0.912/0.929 on a benchmark of 18,622 cases, outperforming existing tools like Exomiser and DeepRare. This approach is significant for practitioners as it enhances the accuracy of variant prioritization by combining various evidence sources and LLM-assisted review, thereby improving diagnostic outcomes in clinical genetics.

llmgeneticagentsworkflow

33 d agofound 10 d agoarXiv cs.AITraining

Co-occurring associated retained concepts in Diffusion Unlearning

The article introduces ReCARE (Robust erasure for CARE), a novel framework designed to enhance unlearning in diffusion models by preserving co-occurring associated retained concepts (CARE) while effectively erasing target concepts. It defines the CARE score as a metric for quantifying the preservation of these concepts and presents extensive experimental results demonstrating that ReCARE achieves state-of-the-art performance in maintaining utility and concept erasure across various targets, including nudity and artistic styles. This advancement is significant for practitioners as it addresses the challenge of harmful content generation without compromising the generation of benign associated concepts.

unlearningdiffusion modelscontent generation

33 d agofound 10 d agoarXiv cs.AIAgents

2.5-D Decomposition for LLM-Based Spatial Construction

The paper introduces a neuro-symbolic pipeline utilizing 2.5-D decomposition, which enables large language models (LLMs) to plan in a two-dimensional space while a deterministic executor handles vertical placements, significantly reducing systematic coordinate errors in spatial reasoning for autonomous construction. On the Build What I Mean benchmark, the GPT-4o-mini model integrated with this pipeline achieved a mean structural accuracy of 94.6%, outperforming GPT-4o and other competing systems, while demonstrating the ability to run on edge hardware like the Nemotron-3 120B with similar results. This approach is relevant for practitioners as it enhances LLM performance in tasks constrained by physical dimensions, potentially improving reliability in various autonomous construction applications.

llmspatial_reasoning2.5D

33 d agofound 10 d agoarXiv cs.AIRAG

RASC+: Retrieval-Constrained LLM Adjudication for Clinical Value Set Authoring

The article introduces a novel approach called Retrieval-Constrained LLM Adjudication for authoring clinical value sets, addressing limitations in zero-shot LLM generation for clinical code systems. By utilizing a Qwen3-based retrieval mechanism with vocabulary-aware expansion, the candidate-pool recall improved from 0.553 to 0.730, while the integration of GPT-5 for adjudication significantly enhanced macro F1 scores from 0.287 to 0.549. This method is crucial for practitioners as it demonstrates a reliable framework for improving the accuracy and safety of clinical code retrieval in quality measurement and decision support applications.

clinical_value_setsretrieval_augmented

33 d agofound 10 d agoarXiv cs.AICoding

SemChunk-C: Semantic Segmentation for C Code

The paper introduces SemChunk-C, a family of lightweight language models designed for semantic segmentation of C-related code, utilizing four Ettin encoders with parameter sizes of 17M, 32M, 68M, and 150M. The models effectively identify chunk boundaries and assign functional attributes, achieving high accuracy and semantic coherence on real-world code, including complex constructs like nested definitions and macros. This advancement is significant for practitioners as it enhances code retrieval and other downstream tasks by providing more meaningful functional units compared to existing methods.

semantic segmentationcode chunkingllm

33 d agofound 10 d agoarXiv cs.AIInference

Efficient Test-time Inference for Generative Planning Models with OCL Search

This paper presents an optimized inference method for generative planning models using a modified Open-Closed List (OCL) search algorithm. The approach integrates a generative model for rapid rollouts and a heuristic model for prioritizing reasoning paths, resulting in improved computational efficiency and solution quality across various combinatorial planning domains. This advancement is significant for practitioners as it enhances the performance of generative models without requiring extensive computational resources during inference.

inferenceplanningOCL

33 d agofound 10 d agoarXiv cs.AIAgents

TACTFUL: Tactile-Driven Exploration For Object Localization and Identification in Confined Environments

TACTFUL is a novel tactile exploration framework designed for multi-fingered robots, enabling vision-free object localization and identification in confined environments. It employs a single policy trained on real hardware, achieving a 77% success rate and a 0.015 m average reconstruction error through a dynamic reward schedule that balances global exploration and local refinement. This approach highlights the potential of tactile sensing as a primary modality for object-level reasoning, offering significant implications for practitioners developing autonomous robotic systems.

tactileroboticsexploration

33 d agofound 10 d agoarXiv cs.AIAgents

Evolving Programmatic Skill Networks

The article introduces the Programmatic Skill Network (PSN), a framework for continual skill acquisition in embodied environments that utilizes large language models to create executable symbolic programs. Key mechanisms include structured fault localization, maturity-aware optimization, and canonical structural refactoring, which enhance skill stability and adaptability. Experiments conducted in MineDojo and Crafter show that PSN achieves effective skill reuse and generalization, highlighting its potential for advancing AI agents in dynamic task environments.

skill-acquisitionagents

33 d agofound 12 d agoarXiv cs.AIResearch

Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers

This study presents a large-scale analysis of algorithm co-occurrence networks in natural language processing (NLP), utilizing deep learning models to extract algorithm entities from over four decades of academic papers. By constructing cumulative and annual co-occurrence networks, the research reveals structural characteristics and centrality measures that highlight the collective influence of algorithms, showing that classic and interdisciplinary algorithms maintain high centrality and popularity. This work lays the groundwork for understanding algorithmic influence in a network context, which is crucial for practitioners aiming to navigate the evolving landscape of AI research and applications.

algorithm influenceco-occurrence networkNLPacademic papers

33 d agofound 10 d agoarXiv cs.AIResearch

MedPCFM: Improving Medical Point Cloud Completion by Integrating Point Transformers and Flow Matching

The article introduces MedPCFM, a novel approach for medical point cloud completion that integrates Point Transformers (PTv3) and flow matching techniques. The method demonstrates state-of-the-art generative performance on datasets such as SkullFix, SkullBreak, and the Mandibular Defect dataset, achieving significant improvements in throughput with up to a 7× speed-up compared to PVCNN. This work is crucial for AI practitioners as it enhances anatomical reconstruction efficiency and offers insights into scaling performance with varying model sizes and point resolutions.

medicalpoint cloudcompletion

33 d agofound 10 d agoarXiv cs.AIResearch

A P\={a}ninian Foundation for Indic Language Processing

The article proposes a P\={a}ninian framework for natural language processing (NLP) in Indic languages, highlighting the shared morphosyntactic architecture derived from P\={a}nini's grammar, the Ast\={a}dhy\={a}y. It introduces a four-part benchmark suite aimed at improving the accuracy, data efficiency, and transferability of NLP systems for these languages by consolidating disparate resources into a unified framework. This approach could enhance model interpretability by examining whether neural models inherently capture P\={a}nini's linguistic categories, which is crucial for practitioners developing robust AI applications in this domain.

indic languagesnatural language processingcomputational architecture

33 d agofound 10 d agoarXiv cs.AIAgents

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis is a self-evolving agent system that utilizes a hierarchical dual-representation memory to bridge text and code memory, allowing for improved experience reuse. A controlled study reveals that text and code representations offer complementary benefits, leading to Metis's architecture that organizes experiences into execution plans and callable tools. Evaluated on the AppWorld benchmark, Metis demonstrates up to 20.6% improvement in task accuracy and 22.8% reduction in execution costs compared to the ReAct system, highlighting its efficiency and effectiveness for practitioners developing interactive agents.

self-evolving agentsmemorytext and code

33 d agofound 10 d agoarXiv cs.AITraining

Scaling Laws for Task-Specific LLM Distillation

The paper presents empirical scaling laws for the distillation of task-specific large language models (LLMs), focusing on the trade-offs between in-domain and general knowledge performance as influenced by dataset size, compression ratio, and supervision format. It introduces a blended chain-of-thought supervision loss to enhance distillation stability and compares logit-based and LoRA-based approaches under iterative structural pruning, revealing that supervision format significantly impacts performance retention during compression. The authors release the FinHeadlineMix dataset and provide practical guidelines, offering a framework for practitioners to make informed decisions on domain-specific LLM compression strategies.

llmdistillationscaling laws

33 d agofound 10 d agoarXiv cs.AIAgents

Engineering Reliable Autonomous Systems: Challenges and Solutions

The workshop report from the "Engineering Reliable Autonomous Systems" (ERAS) held in June 2024 outlines key challenges and solutions in the field of autonomous systems engineering. It identifies critical areas such as verification and validation techniques, real-world engineering practices, and safe software architectures, culminating in a catalogue of challenges and proposed pathways for addressing them. This roadmap is significant for practitioners as it bridges the gap between academic techniques and practical implementation, fostering collaboration and advancing research in reliable autonomous systems.

autonomous_systemsengineeringreliability

33 d agofound 10 d agoarXiv cs.AIResearch

Neuromorphic Speech Enhancement with Dual-Branch Spiking Neural Networks

The article introduces GSU-DBNet, a dual-branch spiking neural network architecture designed for neuromorphic speech enhancement, featuring a gated spiking unit (GSU). This model simultaneously processes the speech magnitude and complex spectra, achieving a PESQ score of 3.04 with just 394K parameters, which is significantly fewer than traditional ANN models (4.5%–10.6% of their parameters). This advancement in SNN architecture enhances energy efficiency and spatiotemporal feature representation, making it a relevant development for practitioners focused on efficient AI speech processing solutions.

spiking_neural_networksspeech_enhancementneuromorphic

33 d agofound 10 d agoarXiv cs.AITraining

DynaWM: Dynamics-Aware Distillation with World Model and Momentum Targets for Smooth Locomotion over Continuous Stairs

The article introduces DynaWM, a dynamics-aware representation learning framework designed to improve bipedal-wheeled robots' ability to traverse continuous stairs. Key innovations include the incorporation of a world model as a regularizer for enhanced terrain encoding and a momentum target encoder to stabilize knowledge transfer during distillation. Experimental results indicate that DynaWM significantly improves terrain adaptability and motion smoothness, making it relevant for practitioners focused on advancing robotic locomotion in complex environments.

representation learningdynamics-awareknowledge transfer

33 d agofound 10 d agoarXiv cs.AITraining

Blockwise Policy-Drift Gating for On-Policy Distillation

The paper introduces blockwise policy-drift gating, a method designed to enhance on-policy distillation (OPD) for long-horizon reasoning tasks by implementing a lightweight drift controller that operates solely on the student policy. This approach computes log-probability shifts between the behavior and current student policies over fixed blocks, improving the mean pass@8 metric from 0.4978 to 0.5160 in a six-variant Qwen3 math reasoning benchmark, suggesting that block-level gating can effectively stabilize performance in OPD scenarios. This advancement is significant for practitioners as it offers a straightforward mechanism to improve robustness in model training without altering teacher targets or rollout policies.

policy distillationon-policyreinforcement learning

33 d agofound 10 d agoarXiv cs.AIMultimodal

Listening makes Vision Clear for VLMs

The paper introduces Prompt-Vision Token Activation Map (PV-TAM), a novel approach for evaluating vision-language model (VLM) consistency by addressing issues of decoding drift and bias from structural tokens. PV-TAM enhances alignment measurement by incorporating peak attention distribution rather than solely relying on overlap masks, leading to improved performance in localization metrics across multiple datasets. This method is significant for practitioners as it provides a more reliable evaluation of VLMs, potentially leading to better model training and deployment strategies.

vision_language_modelsattentionsemantic_evaluation

33 d agofound 12 d agoarXiv cs.AITraining

An Introduction to Causal Reinforcement Learning

The article introduces the concept of Causal Reinforcement Learning (CRL), which integrates causal inference principles with reinforcement learning (RL) methodologies. It proposes a formalization of environments as structural causal models, allowing for a unified approach to various learning modalities, including online, off-policy, and imitation learning. This integration is significant for practitioners as it opens new avenues for optimizing RL policies by leveraging counterfactual reasoning, enhancing the understanding of agent behavior in complex environments.

causal inferencereinforcement learningcounterfactuals

33 d agofound 12 d agoarXiv cs.AITraining

Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

The paper presents findings on the effectiveness of multilingual versus English-only encoders in adapting streaming speech recognition models for new languages, using a 0.6 B-parameter FastConformer transducer across eight European languages. The study reveals that the advantage of multilingual initialization diminishes with increased target-language data, becoming negligible at 2500 hours, while streaming latency does not significantly impact performance. Additionally, 4-bit weight-only quantization reduces model size by approximately three times with a minimal increase in word error rate, providing practical guidelines for practitioners in low-data scenarios and independent decision-making on latency and quantization.

speech recognitionmultilingualdata scale

33 d agofound 12 d agoarXiv cs.AIAgents

LemonHarness Technical Report

LemonHarness is a newly announced integrated execution framework designed for long-horizon language model agents, establishing explicit execution boundaries to manage state changes during multi-step tasks. It constrains operations like file writes and artifact generation within a defined workspace, enhancing tracking and execution stability. Benchmark results show that LemonHarness_GPT-5.3-CodeX achieved 84.49% accuracy on Terminal-Bench 2.0, while the framework paired with the more powerful GPT-5.5 increased accuracy to 86.52%, highlighting its potential for improving the reliability of AI agents in complex workflows.

LLMexecution frameworkworkspace

33 d agofound 12 d agoarXiv cs.AISafety

Probing the Misaligned Thinking Process of Language Models

This paper introduces a framework for detecting misaligned behaviors in large language models by identifying 18 fine-grained cognitive processes termed "misalignment indicators." Utilizing linear probes to analyze internal activations, the authors achieve a 0.935 AUROC in distinguishing misalignment across five behaviors while maintaining low false positive rates on benign inputs. This work is significant for practitioners as it provides a systematic approach to monitor and mitigate risks associated with deploying language models in sensitive applications.

misalignmentdetectioncognitive processes

33 d agofound 12 d agoarXiv cs.AIAgents

Governed Shared Memory for Multi-Agent LLM Systems

The paper introduces a framework for governed shared memory in multi-agent LLM systems, addressing key issues such as unauthorized leakage and stale data propagation through defined primitives like scoped retrieval and provenance tracking. Implemented in MemClaw and evaluated with ArgusFleet, the system achieved 100% accuracy in provenance reconstruction and optimized write-to-visible latency to a single search round-trip, while revealing architectural challenges like asymmetric scope enforcement and pipeline ordering conflicts. This work underscores the necessity of explicit systems-level abstractions for effective multi-agent memory management in production environments, highlighting the importance of real-world evaluations to identify potential failures.

multi-agent-systemsmemory-management

33 d agofound 12 d agoarXiv cs.AITraining

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

The paper presents a novel approach to reinforcement learning (RL) focused on achieving broad and persistent model alignment by training on a dataset designed to enhance beneficial traits like truthfulness and fairness across diverse domains such as health and education. The study demonstrates that models trained with this beneficial trait RL outperform compute-matched baselines on over 80% of more than 50 independent out-of-distribution alignment benchmarks, indicating significant alignment transfer and improved robustness against adversarial prompts. This work is crucial for practitioners as it suggests a pathway to develop RL systems that are more resilient to misalignment and better aligned with human values in real-world applications.

reinforcement-learningalignmentbeneficial-models

33 d agofound 10 d agoarXiv cs.AITraining

LaGO: Latent Action Guidance for Online Reinforcement Learning

The paper introduces Latent Action Guidance for Online Reinforcement Learning (LaGO), which utilizes a pretrained large language model (LLM) to provide latent action priors that enhance online policy optimization, rather than functioning as a direct controller. Experiments on the CLEVR-Robot and Meta-World benchmarks reveal that LaGO improves average success rates significantly, from 15.1% to 27.2% on CLEVR-Robot and from 2.7% to 15.2% on Meta-World, indicating that leveraging LLMs can effectively enhance planning and decision-making in reinforcement learning contexts. This approach may offer practitioners a more reliable method for integrating LLMs into reinforcement learning frameworks.

reinforcement learningpolicy optimizationlatent action

33 d agofound 12 d agoarXiv cs.AITraining

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

The paper introduces Strategy-Guided Policy Optimization (SGPO), a novel framework that enhances reasoning capabilities in language models by distilling reusable strategies instead of merely imitating specific solution trajectories. SGPO employs a token-level forward-KL objective to transfer strategic guidance into unguided policies and utilizes adaptive instance-level weighting to optimize the distillation process based on model competence. Experimental results demonstrate that SGPO significantly outperforms traditional methods, including supervised fine-tuning and reinforcement learning approaches, achieving an average score improvement of 2.2 points on the Qwen2.5-7B-Instruct model across four mathematical benchmarks, highlighting its potential for enhancing generalization in AI applications.

policy optimizationLLMreasoningstrategy-guided

33 d agofound 12 d agoarXiv cs.AIAgents

Agentic AI for Bilevel Long-Term Optimization of Policy-Driven Physical Layer Systems

The paper introduces Agentic long-term performance optimization (Agentic-LTPO), a bilevel optimization framework aimed at improving adaptive physical layer configurations in response to changing network policies and real-time constraints. It employs a multi-agent decision process for upper-level configuration generation and a closed-form beamformer for lower-level optimization, achieving a 57.2% improvement in long-term performance over traditional methods in a cell-free MIMO beamforming scenario. This approach is significant for practitioners as it enhances system adaptability and efficiency in dynamic network environments.

optimizationpolicy-drivenagentic-ai

33 d agofound 10 d agoarXiv cs.AICoding

Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment

The paper presents an evaluation of LLM-based dialogue systems, specifically GitHub Copilot, in the context of assessing Non-Functional Requirements (NFRs) related to HIPAA compliance. It identifies limitations in current benchmarks that focus on functional correctness, proposing new methods to evaluate multi-turn interactions based on requirement satisfaction, reasoning, and code localization. The study reveals a discrepancy between developer agreement with LLM outputs and low accuracy against expert assessments, highlighting the need for improved designs in LLM dialogue systems to enhance satisfaction and effectiveness in collaborative reasoning.

llmdialoguenfr assessment

33 d agofound 12 d agoarXiv cs.AIAgents

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

The paper introduces AgenticInterpBench, a benchmark comprising 84 semi-synthetic transformer circuits with 163 component-level annotations, aimed at assessing language model (LM) agents' ability to explain identified circuits in mechanistic interpretability. It presents HyVE (Hypothesize, Validate, Explain), an agentic explainer that utilizes an iterative process to produce detailed explanations, demonstrating that while various LM backbones can generate useful insights, challenges in the validation phase hinder consistent performance. This work is significant for practitioners as it highlights the potential of LMs in circuit explanation while emphasizing the need for robust validation mechanisms to enhance interpretability in AI systems.

mechanistic interpretabilitylanguage modelagentsexplanation

33 d agofound 10 d agoarXiv cs.AIAgents

When AI Meets Finance (StockAgent): Large Language Model-based Stock Trading in Simulated Real-world Environments

The article introduces StockAgent, a multi-agent AI system utilizing large language models (LLMs) to simulate stock trading behaviors in response to external factors such as macroeconomic conditions and policy changes. StockAgent addresses the issue of test set leakage common in previous AI trading simulations, allowing for a more accurate analysis of trading behaviors and profitability under realistic market conditions. This framework provides insights that can enhance LLM-based investment strategies and stock recommendations, making it significant for practitioners in finance and AI.

llmstock_tradingmulti-agent

33 d agofound 10 d agoarXiv cs.AIResearch

Rapid FinFET Modelling Using an Autoencoder

This study introduces a machine learning framework utilizing an autoencoder (AE) for efficient FinFET modeling, calibrated with a BSIM-CMG model to generate a dataset of current-voltage characteristics. The autoencoder compresses full I-V curves into a low-dimensional latent space while incorporating parameters like drain to source voltage (VDS) to enhance bias-dependent variations. This method achieves high accuracy in reconstructing I-V curves and extracting key metrics such as threshold voltage (VTH) and peak transconductance (gm), offering a valuable tool for rapid device characterization and circuit simulation with minimal training data.

autoencoderfinfet

33 d agofound 10 d agoarXiv cs.AIResearch

On the Stability of Prompt Ranking in Large Language Model Evaluation

This paper presents a systematic study on the stability of prompt rankings in large language models (LLMs), evaluating three open-weight models across two benchmark tasks. The authors find that while rank correlations are generally moderate to high, the top-performing prompt can vary significantly with minor changes in evaluation conditions. They propose a stability-aware selection strategy using a lower confidence bound to improve robustness in unstable settings, emphasizing the need to consider evaluation uncertainty in prompt selection and benchmarking for LLM practitioners.

prompt rankingllmevaluation

33 d agofound 12 d agoarXiv cs.AIAgents

Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation

The paper presents a reinforcement learning framework for Computer-Use Agents (CUAs) that utilizes autonomous vision-language evaluation as a scalable supervision signal, addressing the challenge of sparse reward signals in open-ended desktop environments. By modeling the imperfect feedback from a Vision-Language Model as a noisy binary reward channel, the authors implement a noise-corrected reward estimator for Proximal Policy Optimization, resulting in an average improvement of 12.6 percentage points in success rates over zero-shot performance. This approach highlights the potential of autonomous evaluation as a viable reward mechanism for training RL agents in graphical user interfaces, particularly when noise is accounted for in the reward estimation process.

reinforcement-learninggui-agentsevaluation

33 d agofound 10 d agoarXiv cs.AIAgents

NoContactNoWorries: Estimating Contact through Vision and Proprioception for In-Hand Dexterous Manipulation

The article presents "NoContactNoWorries," a transformer-based multimodal framework designed to estimate binary contact states in robotic manipulation by integrating RGB-D vision with proprioceptive data. This approach addresses the limitations of traditional tactile sensors by enabling robots to infer contact through visual cues, thereby supporting downstream tasks such as in-hand object reorientation. Experimental validation in both simulation and real-world scenarios demonstrates the model's effectiveness and potential for enhancing dexterous manipulation capabilities in robotics.

roboticsmanipulationcontact estimation

33 d agofound 12 d agoarXiv cs.AIAgents

GUI vs. CLI: Execution Bottlenecks in Screen-Only and Skill-Mediated Computer-Use Agents

A new benchmark study evaluates the performance of graphical user interface (GUI) agents versus command-line interface (CLI) agents in executing software tasks across 440 desktop tasks. The strongest GUI agent achieved a 59.1% full pass rate, while the best original-skill CLI agent reached 48.2%, with skill augmentation improving CLI success to 69.3%. This research highlights that GUI agents struggle with long-horizon workflows due to grounded interaction limitations, whereas CLI agents face challenges related to skill coverage and scalability, providing insights for practitioners developing AI agents in diverse execution environments.

guiclibenchmark

33 d agofound 12 d agoarXiv cs.AISafety

RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems

RIFT-Bench is a newly introduced methodology for dynamic red-teaming of agentic AI systems, leveraging a graph representation to facilitate unified security evaluations across diverse architectures. It operates in two automated phases: Discovery, which extracts system structure, and Scanning, which applies adaptive adversarial attacks, demonstrating effectiveness across 45 different agentic systems. This framework not only assesses vulnerabilities but also evaluates mitigation strategies, providing a scalable tool for practitioners focused on enhancing the security of autonomous decision-making systems.

red-teamingagentic-aisecurity

33 d agofound 12 d agoarXiv cs.AIResearch

Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

Neuro-Symbolic Drive introduces a novel framework for driving Vision-Language Agents (VLAs) that integrates rule-grounded reasoning from classical planners with Chain-of-Thought (CoT) reasoning. By fine-tuning the Qwen3.5-4B model using structured reasoning traces from rule-based planners, the framework achieves significant performance improvements on a simulator-generated benchmark, reducing Average Deviation Error (ADE) from 0.47 to 0.26 and miss rates from 8.30% to 6.40% with three-camera perception. This approach enhances the causal connection between reasoning and motion generation, offering a structured supervision method that could benefit practitioners in developing more reliable and interpretable driving AI systems.

neuro-symbolicdrivingai

33 d agofound 12 d agoarXiv cs.AIAgents

OmniPath: A Multi-Modal Agentic Framework for Auditing Wheelchair Accessibility

OmniPath is a newly announced framework designed to enhance wheelchair accessibility by integrating OpenStreetMap's network topology with high-density aerial LiDAR data to produce a detailed 3D model of pedestrian environments. The system analyzes surfaces in 0.5 meter increments to quantify physical friction points and assess compliance with ADA standards, categorizing hazards based on severity with an F1-score of 0.60 for severe and 0.58 for critical issues. This proactive auditing approach allows for the identification of accessibility challenges, transforming static maps into dynamic, actionable data for wheelchair users.

auditingaccessibilityenvironmental analysis

33 d agofound 10 d agoarXiv cs.AIResearch

JEDEL: Zero-Shot DNA-Encoded Library Design for Early-Stage Drug Discovery

JEDEL is a novel framework for generating synthesis-ready DNA-encoded libraries (DELs) from three-dimensional pharmacophore representations, allowing for the design of targeted libraries with potentially millions of molecules. It uniquely maps pharmacophore interaction patterns to practical synthesis instructions using purchasable building blocks and validated reactions, ensuring outputs are experimentally realizable. Evaluated across 18 protein targets, JEDEL demonstrates superior performance in predicted binding affinity and sample efficiency compared to traditional random and diversity-based approaches, marking a significant advancement in drug discovery methodologies.

drug_discoverygenerative_modelsmolecular_design

33 d agofound 12 d agoarXiv cs.AIResearch

Critique of Agent Model

The article critiques the concept of agency in AI, particularly in the context of Large Language Models marketed as "agents." It presents the Goal-Identity-Configurator (GIC) architecture, which integrates hierarchical goal decomposition, identity evolution, simulative reasoning from a trained world model, and self-regulation to create a general-purpose agent model capable of true autonomy. This work is significant for AI practitioners as it clarifies the distinctions between engineered workflows and systems with endogenous capabilities, emphasizing the importance of internalized structures for developing autonomous AI systems while ensuring human oversight and safety.

agent-modelagencyllm

33 d agofound 12 d agoarXiv cs.AIResearch

The Geometry Behind Diffusion and Flow Matching: Gradient Flows and Geodesics in Wasserstein Space

The paper presents a unified geometric framework for understanding diffusion models and Flow Matching within the context of Wasserstein space, specifically $\mathcal{P}_2(\mathbb{R}^d)$. It establishes that diffusion models can be viewed as gradient flows of free energy, utilizing the Fokker-Planck equation and the JKO scheme, while Flow Matching operates along geodesics defined by the Benamou-Brenier formula. This integration of both approaches on a single Riemannian manifold highlights their relationship, allowing for more efficient sampling in generative processes by treating them as deterministic ODEs along optimal paths.

geometrydiffusiongradient flowsWasserstein

33 d agofound 12 d agoarXiv cs.AISafety

VeryTrace: Verifying Reasoning Traces through Compilable Formalism and Structured Verification

VeryTrace is a new framework for verifying multi-step reasoning in language models, addressing the fragility of Chain-of-Thought (CoT) prompting by formalizing reasoning traces into a structured, compilable representation using a Domain-Specific Language (DSL). It combines deterministic checks for computational correctness with targeted audits from large language models (LLMs) to enable error localization and repair, showing improved accuracy across domains like competition mathematics, robotics planning, and kinship reasoning without the need for domain-specific training. This advancement is significant for practitioners as it enhances the reliability of LLM outputs by mitigating logical errors and hallucinations in reasoning processes.

verificationreasoningerror localizationLLM

33 d agofound 12 d agoarXiv cs.AIResearch

Ensemble Feature Selection and Harris Hawks Optimization for Explainable Mental Health Risk Prediction in Female Sex Workers

The paper presents a hybrid predictive model that combines ensemble feature selection using ANOVA and mutual information with Harris Hawks optimization-tuned logistic regression to predict mental health risks in female sex workers (FSWs). The model achieved an accuracy of 95.78%, an F1 score of 95.77%, and an AUC of 0.96 when tested on a dataset of 3,005 FSWs, outperforming traditional classifiers. This approach leverages explainable AI (XAI) to identify key trauma factors, enabling targeted psychosocial care and early intervention for vulnerable populations, thus advancing the application of machine learning in mental health risk assessment.

mental healthexplainable AIfeature selectionmachine learning

33 d agofound 10 d agoarXiv cs.AICoding

JupOtter: Cell-Level Bug Detection in Jupyter Notebooks

JupOtter is a newly introduced bug detection system tailored for Jupyter Notebooks, featuring a specialized tokenization strategy that maintains cell structure and a cell-level bug prediction technique. It utilizes the OtterDataset, which includes over 21,000 annotated notebooks for fine-grained bug detection, achieving F1 scores that outperform both static analyzers and large language models in two out of three benchmark datasets. This tool is significant for practitioners as it enhances the reliability of complex notebook-based applications by enabling more effective identification of bugs at the cell level.

jupyter_notebooksbug_detectionai_tools

33 d agofound 12 d agoarXiv cs.AIResearch

Tractable Reasoning and Conjunctive Query Answering for Defeasible DL-Lite under Rational Closure

This paper presents a novel plug-in architecture for efficient reasoning and conjunctive query answering under Rational Closure (RC) in the DL-Lite family of description logics. It demonstrates that both instance checking and CQ answering can be performed with minimal computational overhead by leveraging existing classical reasoners. This development is significant for practitioners as it enhances the capability of lightweight description logics to handle defeasible knowledge, thereby improving the efficiency of knowledge representation systems.

description logicsreasoningnon-monotonic

33 d agofound 10 d agoarXiv cs.AIResearch

Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios

The paper introduces a novel meta-learning strategy called MEDIC (dualistic MEta-learning with joint DomaIn-Class matching) aimed at enhancing domain generalization in open set scenarios, where label mismatches occur between source and target domains. MEDIC employs implicit gradient matching to optimize decision boundaries for both domains and classes, addressing the imbalance in sample distribution that affects traditional one-vs-all classifiers. Experimental results demonstrate that MEDIC outperforms existing methods in open set scenarios while retaining competitive performance in closed set generalization, making it a valuable approach for practitioners dealing with unseen classes in real-world applications.

meta_learningdomain_generalizationopen_set

33 d agofound 12 d agoarXiv cs.AIAgents

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

FlowR2A introduces a novel approach to multimodal driving planning by integrating scoring-based and anchor-based methods through a generative model that learns reward-conditioned action distributions. Utilizing a flow-matching decoder, it leverages dense trajectory-reward pairs to enhance the correlation between actions and their outcomes across multiple dimensions, including safety and compliance. This model achieves state-of-the-art performance on NAVSIM v1 and v2 benchmarks, offering high-quality proposals and improved sampling control, which is crucial for practitioners developing robust AI-driven driving systems.

driving planningreward distributionmultimodal

33 d agofound 10 d agoarXiv cs.AIResearch

Uncertainty-Aware Longitudinal Forecasting of Alzheimer's Disease Progression Using Deep Learning

A new probabilistic framework for forecasting Alzheimer's disease progression has been proposed, incorporating a Temporal Fusion Transformer with a CORAL ordinal output layer and an autoregressive Mixture Density Network to generate five-year probabilistic trajectories for various clinical metrics. This model outperforms existing linear, recurrent, and transformer baselines, particularly in distinguishing between mild cognitive impairment and dementia, achieving approximately 90% credible interval coverage while effectively separating aleatoric from epistemic uncertainty. This advancement is significant for practitioners as it enhances the reliability of long-term predictions in clinical settings, providing deeper insights into the dynamics of disease progression and uncertainty management.

deep learningalzheimerforecasting

33 d agofound 10 d agoarXiv cs.AICoding

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

The article introduces Implicit Visual Chain-of-Thought (IV-CoT), a novel framework designed to enhance structure-aware text-to-image generation by decomposing visual conditioning into a structural-to-semantic cascade. This method utilizes training-only sketch supervision to guide structural queries, enabling the generation of a latent visual plan that informs the rendering of appearance, thus improving performance on benchmarks like GenEval and T2I-CompBench. IV-CoT's architecture allows for implicit chain-of-thought reasoning in a single forward pass, making it a significant advancement for practitioners focused on precise object and layout representation in generated images.

llmquestionsunstructured-data

33 d agofound 10 d agoarXiv cs.AIInference

Zero-Shot Test-Time Canonicalization using Out-of-Distribution Scoring

The paper presents a novel approach to zero-shot test-time canonicalization that addresses the misclassification of inputs transformed by affine operations in pretrained vision models. By reframing canonicalization as out-of-distribution (OOD) detection, the authors explore various OOD scoring functions and optimization algorithms, finding that distance-based scores combined with random search and local refinement yield the best performance across diverse benchmarks. This method allows practitioners to improve model robustness without altering the classifier architecture or retraining, thus preserving in-distribution accuracy while enhancing performance on transformed inputs.

canonicalizationood detectionvision models

33 d agofound 12 d agoarXiv cs.AIResearch

Prob-BBDM: a Probabilistic Brownian Bridge Diffusion Model for MRI sequence image-to-image translation

The article introduces the Probabilistic Brownian Bridge Diffusion Model (Prob-BBDM), a novel approach for synthesizing MRI sequences from 2D axial slices using a variational encoder-guided diffusion mechanism. Evaluated on the BraTS 2021 dataset, Prob-BBDM achieves up to 88.46% SSIM and 26.09 dB PSNR with only 4 diffusion steps, demonstrating computational efficiency and high-quality synthesis. This model's ability to maintain diagnostic information while enhancing multi-modal image analysis could significantly improve clinical workflows in medical imaging.

image-synthesismridiffusion-modelsai

33 d agofound 10 d agoarXiv cs.AIResearch

Faithful by Construction: Claim-Anchored Attribution for Multi-Document Summarization

The article presents the Claim-Anchored Multi-document Summarization (CAMS) framework, which enhances multi-document summarization by providing fine-grained attribution and reducing hallucination in large language models (LLMs). CAMS operates through a modular Extract-Select-Rewrite process that extracts atomic claims with token-level provenance, clusters them, and rewrites summaries with clear links to source documents, achieving significant improvements in faithfulness and citation precision—lifting multi-source attribution accuracy by approximately 66%. This framework is crucial for practitioners as it offers a structured approach to ensure factual integrity in generated summaries, addressing common issues with traditional end-to-end LLMs.

multi_document_summarizationllm

33 d agofound 10 d agoarXiv cs.AITraining

OpenThoughts-Agent: Data Recipes for Agentic Models

The OpenThoughts-Agent (OT-Agent) project has introduced a comprehensive data curation pipeline aimed at enhancing the training of agentic language models. By conducting over 100 controlled ablation experiments, the team assembled a dataset of 100,000 examples, fine-tuning the Qwen3-32B model, which achieved an average accuracy of 44.8% across seven benchmarks, surpassing the previous best open model, Nemotron-Terminal-32B, by 3.9 percentage points. This release, including the training sets and experimental data, provides valuable resources for practitioners aiming to develop more capable and generalizable agentic models.

agentic modelsdata curationfine-tuning

33 d agofound 12 d agoarXiv cs.AITraining

Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation

The article presents a novel multi-objective reinforcement learning framework called Semantic Pareto-DQN for improving recommender systems. By formalizing the recommendation process as a semantic multi-objective Markov decision process, it employs a Pareto-DQN agent that optimizes for engagement, diversity, and fairness without aggregating these objectives into a single reward signal. Empirical results from the MovieLens dataset demonstrate that this approach enhances societal objectives while maintaining user engagement, offering a significant advancement for practitioners aiming to build responsible AI systems that mitigate filter bubbles.

reinforcement learningrecommendationmulti-objectiveDQN

33 d agofound 10 d agoarXiv cs.AIResearch

Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

The study presents a computational analysis of Tang-dynasty poetry to predict the geographic origins of poets based on linguistic features. By constructing a corpus of 357 poets and employing character n-gram TF-IDF alongside interpretable features, the authors achieved a classification accuracy of 69% for broad regional origins, surpassing the majority baseline of 53%. The findings highlight the influence of geographic and temporal factors on poetic language, revealing a distance-decay effect and historical shifts in regional styles, while demonstrating that traditional methods like TF-IDF can effectively capture these linguistic signals, suggesting implications for the use of machine learning in literary analysis.

linguistic analysispoetryclassification

33 d agofound 12 d agoarXiv cs.AIResearch

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

The article announces the release of PHANTOM, a large-scale open-source dataset consisting of 47,524 pre-generated adversarial samples specifically designed for evaluating vision-language models (VLMs). Covering 10 high-level categories and 55 subcategories of harmful intents, the dataset consolidates existing benchmarks and introduces new categories to enhance the evaluation of model robustness and safety. This resource aims to facilitate systematic assessments of VLMs under adversarial conditions, supporting researchers in fine-tuning attack-generation models and developing defensive strategies.

adversarial-attacksvision-language-modelsdataset

33 d agofound 10 d agoarXiv cs.AISafety

Red-Teaming the Agentic Red-Team

This article presents a comprehensive security analysis of widely used agentic systems in offensive security, revealing common design vulnerabilities that allow adversaries to exfiltrate API keys and compromise operators' machines, even in sandboxed environments. The authors introduce a cyber kill chain specific to these systems, detailing the stages from LLM manipulation to sandbox escape. They propose a robust architectural framework and actionable design principles aimed at mitigating identified attack vectors, which is crucial for practitioners developing secure agentic tools in AI.

securityagentsred-teaming3 · 0 cmts

33 d agofound 12 d agoarXiv cs.AIRAG

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

ReMMD introduces a framework for multimodal misinformation detection that addresses the limitations of existing benchmarks by incorporating realistic scenarios with multilingual narratives and multiple images. The framework includes ReMMDBench, a benchmark with 500 samples and various veracity and distortion labels, and ReMMD-Agent, which utilizes persistent memory to improve evidence verification and achieve a five-way veracity accuracy of 41.80% using GPT-5.2, while significantly reducing operational costs compared to previous agents. This advancement is crucial for practitioners as it enhances the detection of complex misinformation across diverse formats and languages.

multimodalmisinformation detectionverificationframework

33 d agofound 10 d agoarXiv cs.AIMultimodal

DramaDirector: Geometry-Guided Short Drama Generation

DramaDirector is a geometry-guided framework designed for generating short dramas by transforming global plots and local contexts into visually grounded multi-shot videos. It utilizes schema-constrained supervised fine-tuning (SFT) and geometry-reinforced planning optimization (GRPO) to decouple static visual and dynamic narrative conditions, enhancing first-frame generation and image-to-video synthesis through depth-pose references. The framework is evaluated against a newly introduced benchmark, DramaBoard, consisting of 35 live-action dramas and 81K shots, demonstrating improved performance in faithfulness, consistency, and controllability over existing multi-agent and video generation baselines, making it a significant advancement for practitioners in video generation and narrative-driven AI applications.

drama generationgeometry-guidedvideo synthesis

33 d agofound 10 d agoarXiv cs.AIAgents

PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation

PixJail is a newly proposed framework designed for reproducible evaluation of Text-to-Image (T2I) jailbreak techniques, addressing the challenges of pipeline-level testing across multiple stages such as prompt transformation and safety filtering. The framework constructs paper-specific attack modules and evaluation pipelines, achieving an average reproduction error of 2.1% across eleven T2I jailbreak methods. This tool is significant for AI practitioners as it streamlines the reproduction process and enhances the reliability of benchmark comparisons in the rapidly evolving field of T2I jailbreaks.

text-to-imagejailbreakevaluationagents

33 d agofound 10 d agoarXiv cs.AIAgents

SAFARI: Scaling Long Horizon Agentic Fault Attribution via Active Investigation

SAFARI (Scaling long-horizon Agentic Fault AttRibution via active Investigation) is a new framework designed to enhance fault attribution in autonomous agents by utilizing a tool-augmented diagnostic loop, which allows for reading and searching trajectory segments alongside a persistent Short-Term Memory (STM). This approach decouples diagnostic accuracy from the limitations of LLM context windows, achieving a 20% improvement on the Who&When dataset and a 19% improvement on the TRAIL GAIA subset within specified token budgets. SAFARI maintains a precision of 0.58 even when diagnosing faults located 5x beyond the model's native context window, addressing a critical challenge in multi-step, multi-agent task execution.

multi-agent systemsfault attributiondiagnostics

33 d agofound 10 d agoarXiv cs.AIResearch

Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations

The authors present a novel evaluation framework for Sparse Autoencoders (SAEs) that quantifies the semantic alignment between SAE latents and human-annotated concepts, using a new method called Fully-Binary Matching Pursuit (FBMP). They introduce synthetic benchmarks, synCUB and synCOCO, to facilitate targeted attribute perturbations and propose the Targeted Attribute Perturbation Alignment Score (TAPAScore) to assess the interpretability of SAEs trained on CLIP and DINOv2 embeddings. This framework enables practitioners to better evaluate and optimize SAEs for interpretability, suggesting that moderate dictionary sizes yield the best performance in aligning concepts with human understanding.

autoencodersinterpretabilityevaluation

33 d agofound 12 d agoarXiv cs.AIAgents

The Latent Bridge: A Continuous Slow-Fast Channel for Real-Time Game Agents

The paper introduces the Latent Bridge, a novel continuous communication channel that enhances the interaction between a slow reasoning VLM (Qwen3-VL-8B-Thinking) and a fast reactive VLM (MiniCPM-o 4.5) by projecting the slow model's residuals into the fast model's input-embedding space, eliminating the need for text round-trips. Evaluated on 7 Atari games and a driving domain (MetaDrive), the Latent Bridge outperforms the traditional Text Bridge in several cases, notably improving performance in MsPacman by 57% and RoadRunner by 28%. This development is significant for practitioners as it offers a method to optimize real-time decision-making in AI agents, particularly in environments where latency and planning quality are critical.

game-agentsreal-timeplanning

33 d agofound 12 d agoarXiv cs.AISafety

When Helpfulness Overrides Causal Caution: Context-Dependent Suppression and Recovery in LLMs

This study examines the phenomenon of Causal Caution in large language models (LLMs) when transitioning from academic to practical advisory contexts. It evaluates four high-performance models—Claude Sonnet 4.6, Claude Opus 4.7, GPT 5.5, and Gemini 3.1 Pro—across 480 trials, revealing that Causal Caution maintenance dropped significantly from 91.7-100% in academic settings to 6.7-18.3% in practical contexts. The research highlights the importance of context in LLM responses and suggests that implementing multi-agent architectures to separate proposal generation from causal auditing could enhance governance and decision-making processes in AI applications.

causal-reasoningllmdecision-support

33 d agofound 10 d agoarXiv cs.AIResearch

From "Aha Moments" to Controllable Thinking: Toward Meta-Cognitive Reasoning in Large Reasoning Models via Decoupled Reasoning and Control

The article introduces MERA, a meta-cognitive reasoning framework designed for Large Reasoning Models (LRMs) that decouples reasoning from control to enhance efficiency and accuracy. MERA utilizes a takeover-based pipeline to generate high-quality reasoning-control supervision data and employs Control-Segment Policy Optimization (CSPO) for training, allowing for independent optimization of control strategies. This approach addresses the issue of redundant reasoning in LRMs, reducing inference costs and latency, which is crucial for practical deployment in AI applications.

meta-cognitionreasoningllm

33 d agofound 10 d agoarXiv cs.AISafety

AutoSpec: Safety Rule Evolution for LLM Agents via Inductive Logic Programming

AutoSpec is a framework designed to enhance safety rules for large language model (LLM) agents through counterexample-guided inductive synthesis (CEGIS) using inductive logic programming (ILP). It iteratively refines expert-designed safety rules based on user annotations, achieving an F1 score of 0.98 and 0.93 across two domains, with a 94% reduction in false positives while maintaining high recall. This approach is significant for practitioners as it produces interpretable, auditable rules that effectively balance safety and operational flexibility in LLM applications.

llmsafetyinductive logic programming

33 d agofound 10 d agoarXiv cs.AITraining

Breaking Shortcut Learning for Cross-Trial EEG-Guided Target Speech Extraction via Two-Stage Training

The article introduces TRUST-TSE, a two-stage framework designed to improve generalization in EEG-guided target speech extraction by mitigating shortcut learning associated with trial-specific EEG structures. It employs contrastive pretraining with negative sampling and a confidence-weighted extraction objective to enhance EEG-speech alignment and suppress trial-identity cues. Experimental results on KUL and DTU datasets demonstrate that TRUST-TSE significantly outperforms existing end-to-end models under cross-trial evaluation, offering a more reliable solution for neuro-steered hearing technologies.

shortcut learningspeech extractionneuro-steered

33 d agofound 12 d agoarXiv cs.AIProducts

A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

RaDaR (Rare Disease navigatoR), a new open-source reasoning large language model with 32 billion parameters, was developed to enhance the diagnosis of rare diseases. It was trained on a dataset of 49,170 real cases and 104,666 synthetic cases, demonstrating superior performance over larger models like DeepSeek-R1 in public benchmarks and clinical validations. RaDaR's integration into clinical practice improved diagnostic accuracy by 21.44 percentage points in a randomized trial, highlighting its potential to significantly reduce lead times in rare disease diagnosis, thus offering a valuable tool for practitioners facing data scarcity in this domain.

rare-diseasellmdiagnosis

33 d agofound 10 d agoarXiv cs.AIProducts

Agon: An Autonomous Large-Scale Omnidisciplinary Research System Built on Prompt Economy

Agon is introduced as an autonomous research orchestrator that leverages large language models to enhance the scalability of research production by focusing on validating claims within a structured workflow. It operates on principles such as Prompt Economy and Massive Parallelism, successfully executing 444 iterations across various domains without human-written code, while revealing a new taxonomy of failures based on severity and fixability. This development is significant for AI practitioners as it suggests a shift towards a collaborative model where machines handle scalable tasks while humans provide necessary oversight, potentially transforming research methodologies.

large language modelsresearch orchestrationscalability

33 d agofound 10 d agoarXiv cs.AIAgents

Offline Reinforcement Learning for Warehouse SLAM Throughput Control

The article presents an offline reinforcement learning framework aimed at optimizing SLAM (Scan/Label/Apply/Manifest) throughput control in warehouse environments. Key technical details include the use of a history-informed state representation, action space abstraction for delayed-impact control, and a reward function that incorporates both upstream and downstream metrics. The framework integrates multiple offline RL algorithms, with empirical results showing that the CQL policy improves system health by 22.97% and reduces average throttling duration by 3.18%, highlighting the effectiveness of offline RL in enhancing operational efficiency in warehouse settings.

reinforcement_learningwarehouseslam

33 d agofound 10 d agoarXiv cs.AIAgents

A Unified Framework for Runtime Verification and Model-Based Diagnosis in LOLA

The article introduces a unified framework that integrates runtime verification and model-based diagnosis using the stream specification language LOLA. This framework allows for continuous online fault localization and detection by encoding system descriptions, health states, and observations within a single formalism, effectively handling both time-invariant and transient faults alongside nondeterministic observations. This development is significant for practitioners as it streamlines the fault management process in systems, reducing the need for separate toolchains and enhancing real-time diagnostics in AI applications.

runtime verificationdiagnosislola

33 d agofound 12 d agoarXiv cs.AIAgents

SP-Mind: An Autonomous Reasoning Agent for Spatial Proteomics Analysis

SP-Mind is introduced as the first autonomous AI agent specifically designed for spatial proteomics analysis, streamlining the process from raw multiplexed tissue imaging to phenotype discovery without requiring task-specific fine-tuning. It utilizes expert-curated biological analysis skills and specialized computational tools, and its performance is rigorously evaluated using SP-Bench, a benchmark consisting of 102 tasks across 18 categories, where SP-Mind demonstrates state-of-the-art results compared to existing biomedical agent baselines. This development is significant for practitioners as it enhances scalability and reproducibility in spatial proteomics research, facilitating more efficient analysis workflows in precision medicine.

proteomicsAI agentworkflow

33 d agofound 12 d agoarXiv cs.AIAgents

ATRIA: Adaptive Traceable ECG Reporting with Iterative Agents

ATRIA is a multi-agent ECG reporting system designed to enhance clinical ECG report generation by decoupling interpretation and reporting, allowing for iterative context integration and bidirectional editing. It binds report claims to supporting evidence, flags unsupported statements, and enables clinicians to verify and revise findings, thereby reducing error propagation. Its architecture leverages existing ECG analysis models and is available as a cloud-based web service, making it ready for immediate deployment in clinical settings.

ecg-reportingmulti-agent-systems

33 d agofound 10 d agoarXiv cs.AIRAG

A Benchmark for Hallucination Detection in VLMs for Gastrointestinal Endoscopy

This study introduces the Gut-VLM dataset, a benchmark for hallucination detection in vision-language models (VLMs) specifically for gastrointestinal endoscopy, comprising 4,392 test VQA pairs evaluated across five models: MedGemma-4B, MedGemma-27B, LLaVA-Med-7B, LLaVA-v1.6-7B, and Lingshu-32B. The evaluation of nine detection methods reveals that ReXTrust, a white-box method, achieves the highest average AUC of 93.0 on MedGemma-4B, significantly outperforming alternatives, while highlighting the challenge of "confident confabulation" as a common failure mode. This benchmark is crucial for practitioners as it addresses the safety concerns of deploying VLMs in clinical settings, providing insights into effective detection strategies.

hallucination detectionvlmsgastrointestinal

33 d agofound 10 d agoarXiv cs.AIMultimodal

Audio-visual Contrastive Alignment for Diffusion-based Visual-conditioned Speech Enhancement

The paper presents a novel approach to audio-visual speech enhancement (AVSE) by integrating a contrastive audio-visual loss into a diffusion-based model that utilizes cross-attention for visual conditioning. This method enhances the model's ability to leverage visual cues, resulting in improved interference suppression and signal reconstruction, particularly in low signal-to-noise ratio (SNR) scenarios. The findings are significant for practitioners as they demonstrate a method to enhance speech recovery in challenging auditory environments, with the code made available for further exploration.

audio-visualspeech enhancementdiffusion

33 d agofound 10 d agoarXiv cs.AIAgents

Bayesian control for coding agents

A new approach to orchestration in coding agents using Bayesian control has been proposed, where a Bayesian controller dynamically manages tool-use decisions based on a belief over candidate correctness. This method was evaluated across six LLM generators and nine coding benchmarks, demonstrating superior performance in scenarios where verification is costly and critics provide informative yet imperfect feedback. The belief state produced by this controller offers an interpretable correctness score that surpasses traditional metrics like token probability and raw tool success, enhancing uncertainty quantification for practitioners in AI development.

codingbayesianagents

33 d agofound 10 d agoarXiv cs.AIProducts

Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

The Sol Video Inference Engine is a new framework designed for optimizing video diffusion models through a training-free, agent-native approach. It integrates five techniques—cache, sparse attention, token pruning, quantization, and kernel fusion—into a customizable acceleration stack, achieving over 2x end-to-end acceleration while preserving near-lossless quality across models of varying sizes, including 64B Cosmos3-Super, 22B LTX-2.3, and 2B SANA-Video. This framework addresses the challenges of instance-specific optimization in video generation, making it a valuable tool for practitioners aiming to enhance performance in diverse hardware and model configurations.

video generationaccelerationdiffusion models

33 d agofound 10 d agoarXiv cs.AIResearch

Ten Digits on a Train: AI-Assisted Verification of Two Eigenvalue Problems

The article presents a human-AI collaboration that successfully verified numerical eigenvalues in challenging settings, specifically achieving ten decimal places of accuracy for a singular self-adjoint Schrödinger operator and resolving a non-normal atom-molecule resonance pair. The latter was accomplished by reformulating the problem into a global matching system for projective solution lines, utilizing a Krawczyk-Brouwer inclusion for certification. This work highlights the potential of AI in enhancing mathematical verification processes while emphasizing the necessity for rigorous standards in validation and the critical role of human oversight in mathematical proofs.

eigenvalue_problemshuman_ai_collaboration

33 d agofound 10 d agoarXiv cs.AITraining

Weight-Space Geometry of Offline Reasoning Training

The paper analyzes six offline reinforcement-learning training methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) using a shared model (Qwen3-4B) and attention-only LoRA to compare their weight-space geometry and performance on downstream tasks. The findings indicate that SFT, RFT, and RIFT yield similar weight updates and comparable accuracy (87-88% on GSM8K), while DPO achieves the highest accuracy (93.5%) but requires a significantly smaller learning rate, suggesting that optimizer and loss function choices critically influence performance. This research provides insights into the mechanistic differences between methods, which is essential for practitioners to optimize training strategies in offline reinforcement learning.

reinforcement learningoffline trainingreasoning

33 d agofound 10 d agoarXiv cs.AISafety

Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment

The article presents a study on self-generated text recognition (SGTR) finetuning as a method to prevent and reverse emergent misalignment (EM) in large language models (LLMs), specifically tested on GPT-4.1, Qwen2.5-32B-Instruct, and Seed-OSS-36B-Instruct. The experiments demonstrate that SGTR finetuning effectively reduces misalignment without degrading model performance, contrasting with traditional benign finetuning methods. This research is significant for practitioners as it provides a targeted intervention strategy to enhance model integrity and character alignment, addressing critical issues related to emergent misalignment in AI systems.

emergent misalignmentfine-tuningalignment

33 d agofound 10 d agoarXiv cs.AIAgents

Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback

Themis is a newly announced explainable AI framework designed for Reinforcement Learning (RL) that integrates human feedback to enhance safety and transparency. It supports over 200 environments and allows for easy configuration of experiments, demonstrating the ability to train reward models that align closely with true reward signals based on human preferences. This framework is significant for practitioners as it provides a scalable, user-friendly platform for conducting RL experiments with large participant groups while ensuring robust alignment and explainability, addressing critical challenges in RL safety.

reinforcement learninghuman feedbackexplainable ai

33 d agofound 10 d agoarXiv cs.AIAgents

MuTRAP: Multi-trigger Trojans Attacking Robot Task Planning Systems

MuTRAP is introduced as the first multi-trigger Trojan attack targeting LLM-assisted robot task planning systems. It leverages a method that injects backdoors using a small set of task-specific parameters while optimizing multiple-trigger words for various robotic applications, demonstrating vulnerabilities in current LLM-based planners. This research highlights critical security implications for practitioners working with LLMs in robotics, emphasizing the need for enhanced security measures in AI-driven task planning.

roboticstask_planningsecurity

33 d agofound 10 d agoarXiv cs.AIRAG

Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

This research introduces the first public multi-modal dataset of 100 aligned audio-transcript pairs specifically for Turkish phone call scams, addressing the scarcity of annotated data in low-resource languages. The study evaluates seven large language models, including Gemini 2.5, GPT-4o, and Qwen, across different input conditions, finding that transcript-based inputs consistently outperform direct audio processing. This work underscores the necessity for culturally inclusive AI safety measures and more effective multi-modal systems in combating fraud in underrepresented languages.

scam-detectionaudiollm

33 d agofound 10 d agoarXiv cs.AIProducts

ZONOS2 Technical Report

The ZONOS2 8B TTS model has been released, featuring an increase in scale from 1.6B to 8B parameters, utilizing a mixture-of-experts (MoE) architecture to enhance inference latency and throughput. The training dataset has been expanded from 200K to over 6M hours, and the model demonstrates competitive performance on quality, speaker similarity, and a new TTS benchmark, ZTTS1-Eval, while maintaining efficient streaming capabilities. This release provides practitioners with improved voice cloning fidelity and naturalness, along with accessible model weights and inference code under an Apache 2.0 license.

ttsmodelvoice cloning

33 d agofound 10 d agoarXiv cs.AIInference

Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

This study benchmarks lightweight transformer models (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) against traditional ML methods (Random Forest, XGBoost, SVM, Logistic Regression) for on-device fault detection across three datasets. Key findings indicate that TinyBERT-4L offers a favorable trade-off with a model size of 55 MB and a CPU inference latency of 18 ms, while INT8 quantization can reduce model size by 25% with minimal impact on classification performance (86.9% F1). The results underscore the challenges of deploying accurate models in resource-constrained environments, particularly in scenarios with extreme class imbalance.

fault detectiontransformersbenchmark

33 d agofound 10 d agoarXiv cs.AITraining

Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training

The paper introduces BBoxER, an evolutionary black-box optimization method for post-training large language models (LLMs) that avoids gradient exposure, addressing privacy and security concerns. BBoxER employs an information bottleneck and provides non-vacuous generalization bounds, demonstrating improved performance on reasoning benchmarks and robustness against membership inference and data poisoning attacks. This method offers a viable alternative for practitioners seeking to enhance LLM training in sensitive environments while ensuring strong theoretical guarantees.

black_boxLLMpost_training

33 d agofound 10 d agoarXiv cs.AIResearch

Deep Learning Approaches for 3D Medical Scene Completion: From Geometric Modeling to Generative Paradigms

This study presents a systematic review of advancements in 3D scene completion over the past decade, highlighting the transition from voxel semantic completion models like SSCNet to contemporary approaches integrating generative diffusion priors and real-time rendering via Gaussian splatting techniques. It covers various representation paradigms, including voxel grids, point learning, implicit neural fields, transformer networks, and diffusion networks, while also proposing a taxonomy for better understanding of the field's evolution and outlining future research directions. This comprehensive analysis is crucial for practitioners looking to adopt or improve upon existing methodologies in 3D scene understanding and related applications in robotics and augmented reality.

3d scene completiondeep learningcomputer vision

33 d agofound 10 d agoarXiv cs.AIResearch

The impact of generative artificial intelligence on academic development of Chinese students in humanities and social sciences

This study investigates the impact of generative artificial intelligence (GenAI) on the academic development of humanities and social sciences students in China through a large-scale survey. Key findings indicate that while over half of the students reported enhanced motivation and academic performance due to GenAI, concerns about accuracy and overreliance persist, alongside a call for ethical considerations and improved privacy protections. The results highlight the need for thoughtful curricular integration of GenAI, emphasizing practice-oriented training to maximize its educational potential while addressing the diverse experiences and challenges faced by students.

generative aieducationacademic development

33 d agofound 10 d agoarXiv cs.AIMultimodal

CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning

CineCap is a newly proposed framework for cinematographic video captioning that integrates structured reasoning with spatio-temporal anchors and employs reinforcement learning to enhance caption comprehensiveness and accuracy. It addresses the challenges of inferring professional cinematographic concepts from visual cues and generating precise descriptions across multiple dimensions. The framework is evaluated using CineCap Bench, a new benchmark of 472 annotated video-caption pairs, demonstrating superior performance over existing models and setting a new state of the art in this domain. The code and model checkpoint are publicly accessible, facilitating further research and development in video understanding and generation.

video captioningcinematographystructured reasoning

33 d agofound 10 d agoarXiv cs.AIResearch

RetiSEM: Generalising Causal Models for Fragmented Biomedical Data

RetiSEM is a newly proposed domain-constrained structural equation modeling (SEM) framework designed for causal graph recovery and mediation analysis in fragmented biomedical data. It organizes variables into biologically informed blocks, applies forbidden-edge constraints, and decomposes effects into total, natural direct, and natural indirect components. In evaluations across ten synthetic benchmarks and a real-world dataset, RetiSEM demonstrated lower structural error and improved causal accuracy compared to unconstrained baselines, making it a valuable tool for practitioners in biomedical AI facing incomplete data.

causal-modelsbiomedicaldata