Research — AI news — AI News Digest

The Swiss Federal Supreme Court is evaluating Heretic

The Swiss Federal Supreme Court is assessing the Heretic model for its own applications, particularly to address issues with LLMs denying legitimate requests. A paper titled “Measuring & Mitigating Over-Alignment for LLMs in Multilingual Criminal Law Courts” explores solutions to this problem, highlighting Heretic's effectiveness in its analysis. This evaluation is significant for practitioners as it indicates potential advancements in handling LLM alignment issues in legal contexts.

Reddit r/LocalLLaMA32 d agofound 12 d ago#llm#alignment#court

The emergence of the web data infrastructure layer for AI

The article discusses the development of a web data infrastructure layer aimed at improving data accessibility and structure for AI applications. This infrastructure is crucial for enterprises seeking to utilize AI at scale, as it addresses the challenges posed by unstructured and blocked data that can hinder model performance. By enhancing data availability and organization, it enables practitioners to build more effective AI systems and leverage emerging use cases.

MIT Technology Review — AI32 d agofound 12 d ago#data-infrastructure#ai#web

Pangram CEO says language models give themselves away by making the same arguments

Pangram CEO Max Spero highlights a limitation of current language models, noting that while they produce coherent text, their arguments tend to cluster around similar points rather than exhibiting the diverse reasoning found in human discourse. This homogeneity in argumentation may serve as a distinguishing factor for detecting AI-generated content. Understanding this characteristic is crucial for practitioners aiming to enhance the robustness and variability of AI outputs in applications requiring nuanced reasoning.

The Decoder33 d agofound 12 d ago#language_models#arguments#diversity

MedPCFM: Improving Medical Point Cloud Completion by Integrating Point Transformers and Flow Matching

The article introduces MedPCFM, a novel approach for medical point cloud completion that integrates Point Transformers (PTv3) and flow matching techniques. The method demonstrates state-of-the-art generative performance on datasets such as SkullFix, SkullBreak, and the Mandibular Defect dataset, achieving significant improvements in throughput with up to a 7× speed-up compared to PVCNN. This work is crucial for AI practitioners as it enhances anatomical reconstruction efficiency and offers insights into scaling performance with varying model sizes and point resolutions.

arXiv cs.AI33 d agofound 10 d ago#medical#point cloud#completion

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

CALIBER (Calibration Before and After Reasoning) is a novel approach that improves confidence estimation in reasoning language models by distinguishing between pre- and post-answer confidence assessments. It reduces Expected Calibration Error (ECE) by 52.5% on the BigMathDigits dataset for a 7B model and achieves competitive results on a larger 30B model, demonstrating significant improvements in calibration, Brier score, and AUROC across various benchmarks, particularly under distribution shifts. This method is crucial for practitioners as it enhances the reliability of model outputs, especially in complex reasoning tasks, by aligning confidence estimates with the model's state of information.

arXiv cs.AI33 d agofound 10 d ago#confidence#reasoning#llm

JEDEL: Zero-Shot DNA-Encoded Library Design for Early-Stage Drug Discovery

JEDEL is a novel framework for generating synthesis-ready DNA-encoded libraries (DELs) from three-dimensional pharmacophore representations, allowing for the design of targeted libraries with potentially millions of molecules. It uniquely maps pharmacophore interaction patterns to practical synthesis instructions using purchasable building blocks and validated reactions, ensuring outputs are experimentally realizable. Evaluated across 18 protein targets, JEDEL demonstrates superior performance in predicted binding affinity and sample efficiency compared to traditional random and diversity-based approaches, marking a significant advancement in drug discovery methodologies.

arXiv cs.AI33 d agofound 10 d ago#drug_discovery#generative_models#molecular_design

Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization

The article presents a novel framework for Multi-Hop Fact Verification that utilizes Structural Causal Models (SCM) and Group Relative Policy Optimization (GRPO) to enhance reasoning accuracy in Large Language Models (LLMs). By employing directed dependency graphs to model structural relationships between evidence and claims, the framework addresses the challenges of hallucinations and logical fragmentation in existing methods. Empirical results indicate that this approach not only improves performance over strong baselines on datasets like HoVer and EX-FEVER but also provides more traceable reasoning structures, which is critical for practitioners focusing on reliable AI systems.

arXiv cs.AI33 d agofound 10 d ago#multi-hop#reasoning#verification

A Survey on Federated Causal Discovery and Inference

The paper presents a comprehensive survey on Federated Causal Discovery (FCD) and Federated Causal Inference (FCI), addressing the challenges of conducting causal analysis with distributed data while adhering to privacy regulations. It organizes FCD methods based on how structures are learned, data partitioned, and the structural knowledge obtained, and categorizes FCI methods by target estimand and estimation strategy, including classical and deep generative approaches. This work is significant as it formalizes the relationship between FCD and FCI, proposing a unified pipeline that enhances causal reasoning in federated settings while identifying key areas for future research, such as privacy and communication efficiency.

arXiv cs.AI33 d agofound 10 d ago#federated learning#causal discovery#inference

Prob-BBDM: a Probabilistic Brownian Bridge Diffusion Model for MRI sequence image-to-image translation

The article introduces the Probabilistic Brownian Bridge Diffusion Model (Prob-BBDM), a novel approach for synthesizing MRI sequences from 2D axial slices using a variational encoder-guided diffusion mechanism. Evaluated on the BraTS 2021 dataset, Prob-BBDM achieves up to 88.46% SSIM and 26.09 dB PSNR with only 4 diffusion steps, demonstrating computational efficiency and high-quality synthesis. This model's ability to maintain diagnostic information while enhancing multi-modal image analysis could significantly improve clinical workflows in medical imaging.

arXiv cs.AI33 d agofound 12 d ago#image-synthesis#mri#diffusion-models#ai

MGI: Member vs Generated Inference

The article introduces the concept of Member vs Generated Inference (MGI), which addresses the challenge of distinguishing between samples from a generative model's training set and samples generated by the model itself. It presents a novel method called Data Circuit Breaker (DCB), which utilizes a three-stage approach combining signals from an autoencoder and latent generator, demonstrating effectiveness across various generative models, including image autoregressive and diffusion models. This advancement is significant for practitioners as it enhances the reliability of membership inference in scenarios where models may reproduce training data, thereby improving the security and integrity of generative AI applications.

arXiv cs.AI33 d agofound 10 d ago#generative_models#membership_inference#data_security

Towards Federated Long-Tailed Graph Learning: An Energy-Guided Dual Decoupling Approach

The paper introduces FedEPD, a novel framework for Federated Graph Learning that addresses the challenges posed by long-tailed data distributions. It employs a dual decoupling approach to separate topological purification from semantic recalibration, utilizing distribution-aware Dirichlet energy pruning and a two-stage alternating optimization strategy. FedEPD achieves state-of-the-art performance, with improvements of up to 4.97% in accuracy and 5.48% in Macro-F1 across various long-tailed benchmarks, making it significant for practitioners dealing with imbalanced data in collaborative environments.

arXiv cs.AI33 d agofound 12 d ago#federated learning#graph learning#long-tailed

Token Complexity of Certifying Stochastic-Oracle Reliability

The paper introduces a framework for certifying the reliability of stochastic oracles, defining "certification token complexity" as the minimum expected token cost to distinguish between oracles that meet a specified reliability level and those that do not. It presents a Sequential Probability Ratio Test (SPRT)-based Stochastic-Oracle Turing Machine (SOTM) that effectively queries oracles and computes correctness scores while ensuring two-sided error guarantees. This work is significant for practitioners as it provides theoretical bounds on token complexity, informing the design of efficient certification processes in AI systems that rely on stochastic oracles.

arXiv cs.AI33 d agofound 10 d ago#stochastic_oracle#token_complexity

On the Stability of Prompt Ranking in Large Language Model Evaluation

This paper presents a systematic study on the stability of prompt rankings in large language models (LLMs), evaluating three open-weight models across two benchmark tasks. The authors find that while rank correlations are generally moderate to high, the top-performing prompt can vary significantly with minor changes in evaluation conditions. They propose a stability-aware selection strategy using a lower confidence bound to improve robustness in unstable settings, emphasizing the need to consider evaluation uncertainty in prompt selection and benchmarking for LLM practitioners.

arXiv cs.AI33 d agofound 10 d ago#prompt ranking#llm#evaluation

Transformation Behavior of Images in Latent Space

The paper investigates the transformation behavior of images in latent space for histopathology classification, focusing on encoder networks from Lunit Inc., Bioptimus, and the Meta Research Team. It finds that while embeddings of original and transformed images maintain proximity, indicating robustness, they are not entirely invariant to transformations, highlighting the need for tailored encoder training to enhance performance in downstream tasks. This research underscores the importance of understanding latent space behavior for improving data augmentation strategies in histopathological applications.

arXiv cs.AI33 d agofound 10 d ago#latent space#image transformation

The African Language Tax: Quantifying the Cost, Latency, and Context Penalty of Tokenizing African Languages in Frontier LLMs

The article quantifies the tokenization penalties faced by speakers of African languages when using frontier large language models (LLMs), revealing that these languages incur a median tokenization premium of 1.88x compared to English, with certain scripts like N'Ko experiencing up to 8.92x. The study evaluates 20 African languages across various tokenizers, finding that the best tokenizer, Gemma 4, still leaves a significant premium of 2.38x. This disparity leads to increased inference costs and reduced context windows for African language users, highlighting a critical digital divide that affects language accessibility in AI applications.

arXiv cs.AI33 d agofound 10 d ago#tokenization#african languages#llm

Rapid FinFET Modelling Using an Autoencoder

This study introduces a machine learning framework utilizing an autoencoder (AE) for efficient FinFET modeling, calibrated with a BSIM-CMG model to generate a dataset of current-voltage characteristics. The autoencoder compresses full I-V curves into a low-dimensional latent space while incorporating parameters like drain to source voltage (VDS) to enhance bias-dependent variations. This method achieves high accuracy in reconstructing I-V curves and extracting key metrics such as threshold voltage (VTH) and peak transconductance (gm), offering a valuable tool for rapid device characterization and circuit simulation with minimal training data.

arXiv cs.AI33 d agofound 10 d ago#autoencoder#finfet

Uncertainty-Aware Longitudinal Forecasting of Alzheimer's Disease Progression Using Deep Learning

A new probabilistic framework for forecasting Alzheimer's disease progression has been proposed, incorporating a Temporal Fusion Transformer with a CORAL ordinal output layer and an autoregressive Mixture Density Network to generate five-year probabilistic trajectories for various clinical metrics. This model outperforms existing linear, recurrent, and transformer baselines, particularly in distinguishing between mild cognitive impairment and dementia, achieving approximately 90% credible interval coverage while effectively separating aleatoric from epistemic uncertainty. This advancement is significant for practitioners as it enhances the reliability of long-term predictions in clinical settings, providing deeper insights into the dynamics of disease progression and uncertainty management.

arXiv cs.AI33 d agofound 10 d ago#deep learning#alzheimer#forecasting

Evaluating LLM Usage for Efficient and Explainable Numerical and Classified Implicit Sentiment Analysis of Product Desirability

This paper presents a framework leveraging large language models (LLMs) for implicit sentiment analysis of product desirability, utilizing two datasets with 106 respondent groupings. The framework achieved Pearson correlations up to 0.97 and classification accuracy of 94%, with GPT-4o-mini demonstrating comparable performance to larger models at 94% lower cost. The approach enhances interpretability through model confidence ratings and human-readable explanations, making it a valuable tool for practitioners in product evaluation and development.

arXiv cs.AI33 d agofound 10 d ago#llm#sentiment analysis#product feedback

Critique of Agent Model

The article critiques the concept of agency in AI, particularly in the context of Large Language Models marketed as "agents." It presents the Goal-Identity-Configurator (GIC) architecture, which integrates hierarchical goal decomposition, identity evolution, simulative reasoning from a trained world model, and self-regulation to create a general-purpose agent model capable of true autonomy. This work is significant for AI practitioners as it clarifies the distinctions between engineered workflows and systems with endogenous capabilities, emphasizing the importance of internalized structures for developing autonomous AI systems while ensuring human oversight and safety.

arXiv cs.AI33 d agofound 12 d ago#agent-model#agency#llm

Cycle-Consistent Neural Explanation of Formal Verification Certificates

The paper introduces a cycle-consistent neural architecture designed to generate natural language explanations for formal verification certificates, comprising a forward network (NN1) and an inverse network (NN2) that together ensure faithful reconstruction of the certificates. Evaluated on 420 test certificates from various verification methods, the model achieves 90.0% cycle-verified soundness, significantly outperforming a multi-LLM few-shot baseline by 13.9 percentage points, while also providing 860x faster inference times and deterministic outputs. This advancement is crucial for practitioners as it enables efficient, offline explanations of verification results, enhancing accessibility for non-specialists without the reliance on cloud-based systems.

arXiv cs.AI33 d agofound 12 d ago#formal-verification#neural-networks#explanations

RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting

The Regime-Aware Variable-context Expert Network (RAVEN) has been introduced as a Mixture-of-Experts framework specifically designed for financial time series forecasting. RAVEN adapts its temporal context dynamically rather than relying on a fixed look-back window, utilizing a Cumulative Importance Thresholding mechanism to create nested context windows and incorporating a Global Compressed Representation for enhanced temporal coherence. Experimental results indicate that RAVEN outperforms state-of-the-art models, achieving significant improvements in Pearson correlation and mean squared error across multiple financial datasets, which is crucial for practitioners dealing with non-stationary financial data.

arXiv cs.AI33 d agofound 10 d ago#financial_forecasting#time_series

Large-Language-Model Discovery of Quantum LDPC Codes through Structured Concept Evolution

The paper presents a novel search framework called structured concept evolution (SCE) that utilizes large language models (LLMs) to discover quantum low-density parity-check (qLDPC) codes, specifically lifted-product code families. By employing a structured algebraic mutation grammar, SCE evolves algebraic specifications and executable programs, leading to the identification of competitive code families, including those over non-abelian groups, characterized under code-capacity depolarizing noise using BP+OSD decoding. This approach highlights the potential of LLMs in tackling complex design problems in quantum error correction, offering practitioners new methodologies for code development in quantum computing.

arXiv cs.AI33 d agofound 10 d ago#quantum#llm#codes

Ensemble Feature Selection and Harris Hawks Optimization for Explainable Mental Health Risk Prediction in Female Sex Workers

The paper presents a hybrid predictive model that combines ensemble feature selection using ANOVA and mutual information with Harris Hawks optimization-tuned logistic regression to predict mental health risks in female sex workers (FSWs). The model achieved an accuracy of 95.78%, an F1 score of 95.77%, and an AUC of 0.96 when tested on a dataset of 3,005 FSWs, outperforming traditional classifiers. This approach leverages explainable AI (XAI) to identify key trauma factors, enabling targeted psychosocial care and early intervention for vulnerable populations, thus advancing the application of machine learning in mental health risk assessment.

arXiv cs.AI33 d agofound 12 d ago#mental health#explainable AI#feature selection#machine learning

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

The article introduces RV-Bench, a new evaluation methodology designed to assess large language models' mathematical reasoning capabilities using random variable questions (RVQs). By generating questions with randomized variable combinations that are "unseen" to the models, RV-Bench aims to measure genuine reasoning skills rather than memorization. Experiments conducted on over 30 LLMs with more than 1,000 RVQs indicate a significant proficiency gap between familiar and unseen data distributions, emphasizing the need for improved generalization techniques in mathematical reasoning tasks for LLMs.

arXiv cs.AI33 d agofound 10 d ago#llm#mathematical_reasoning#benchmark

AI Tokenomics: The Economics of Tokens, Computation, and Pricing in Foundation Models

The paper introduces a framework for AI tokenomics, analyzing the generation, consumption, pricing, and optimization of tokens in foundation models. It connects token-level costs to broader workflow production functions and highlights the distinction between token expenditure and economic value, which is influenced by factors such as marginal productivity and downstream effects. This framework is crucial for practitioners as it informs resource allocation and market design in AI systems, while also identifying key research areas like hidden-token measurement and dynamic allocation.

arXiv cs.AI33 d agofound 10 d ago#ai#tokenomics#foundation models

On the Smallness of the Large Language Models Scaling Exponents

The article discusses the implications of scaling exponents in Large Language Models (LLMs), highlighting their indication of an unsustainable energy consumption regime. It critiques the notion that the smallness of these exponents is merely a numerical bias related to the "pedestal effect" and emphasizes that this does not resolve the sustainability concerns. Additionally, it explores how data characteristics, such as smoothness and roughness, influence scaling exponents, drawing parallels with fluid turbulence models, which may inform future model design and efficiency considerations for practitioners.

arXiv cs.AI33 d agofound 12 d ago#scaling#llm#sustainability

Ensemble Learning for Large Language Models in Text and Code Generation: A Survey

The article presents a survey on ensemble learning techniques for Large Language Models (LLMs) in text and code generation, categorizing methods into seven types: weight merging, knowledge fusion, mixture-of-experts, reward ensemble, output ensemble, routing, and cascading. It emphasizes the potential of these ensemble approaches to improve output quality, representation diversity, and application flexibility, addressing limitations of individual LLMs such as inconsistency and bias. This work is significant for practitioners as it provides a framework for selecting and implementing ensemble strategies, potentially enhancing performance in real-world applications and paving the way for multimodal LLM advancements.

arXiv cs.AI33 d agofound 10 d ago#llm#ensemble_learning#code_generation

A Fair Evaluation of Graph Foundation Models for Node Property Prediction

The paper presents a comprehensive evaluation of nine recent Graph Foundation Models (GFMs) for node property prediction, addressing the lack of standardized benchmarks in the field. Key findings indicate that only the latest GFMs leveraging the Prior-data Fitted Networks paradigm surpass well-tuned Graph Neural Networks (GNNs) in predictive performance, albeit with increased inference costs. This work is crucial for practitioners as it provides a clearer understanding of the trade-offs between GFMs and GNNs, facilitating informed model selection for applications in fraud detection and recommendation systems.

arXiv cs.AI33 d agofound 10 d ago#graph#node-prediction#evaluation

Deep Learning Approaches for 3D Medical Scene Completion: From Geometric Modeling to Generative Paradigms

This study presents a systematic review of advancements in 3D scene completion over the past decade, highlighting the transition from voxel semantic completion models like SSCNet to contemporary approaches integrating generative diffusion priors and real-time rendering via Gaussian splatting techniques. It covers various representation paradigms, including voxel grids, point learning, implicit neural fields, transformer networks, and diffusion networks, while also proposing a taxonomy for better understanding of the field's evolution and outlining future research directions. This comprehensive analysis is crucial for practitioners looking to adopt or improve upon existing methodologies in 3D scene understanding and related applications in robotics and augmented reality.

arXiv cs.AI33 d agofound 10 d ago#3d scene completion#deep learning#computer vision

End-to-End Radar and Communication Modulation Recognition with Neuromorphic Computing

The article presents EMRFormer, a novel end-to-end spiking neural network (SNN) architecture designed for automatic modulation recognition (AMR) on neuromorphic hardware. EMRFormer combines an adaptive spike encoder, Integer Leaky Integrate-and-Fire neurons, and integrates spike-separable Convolutional Neural Networks with Spike-Driven Transformers, achieving state-of-the-art accuracy while reducing theoretical energy consumption by over 90%. This model is validated on various datasets and shows significant power efficiency, achieving up to a 5x reduction in power usage compared to traditional GPUs, making it a compelling solution for AMR in resource-constrained environments.

arXiv cs.AI33 d agofound 10 d ago#neuromorphic#snn#deep learning#amr

Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War

The article introduces the "Age of LLM," a new turn-based benchmark designed for evaluating the reasoning, diplomacy, and reliability of large language models (LLMs) in a competitive setting on a 13x7 grid. It emphasizes the impact of fog of war and strict JSON schema adherence, benchmarking 15 models across 54 matches and 5,258 actions, revealing insights such as the dominance of nuclear strategies and the relationship between reliability and winning outcomes. This benchmark provides a unique framework for practitioners to analyze LLM behavior under adversarial conditions, particularly in terms of belief tracking and cognitive strategies, with resources available for further exploration.

arXiv cs.AI33 d agofound 12 d ago#benchmark#llm#reasoning#diplomacy

SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization

The article introduces SURGELLM, a unified transformer framework designed to enhance performance across diverse NLP tasks by addressing issues such as inductive bias mismatch and class-imbalance in feature statistics. Key innovations include a surgical feature gate, task-conditioned prefix tokens, and Instance-Weighted Normalization (IWN), which collectively improve macro-F1 scores, achieving 0.940 in benchmarks across four tasks. This framework is significant for practitioners as it provides a method to optimize multi-task learning by leveraging task-specific features and normalization techniques, potentially leading to more robust model performance in heterogeneous environments.

arXiv cs.AI33 d agofound 10 d ago#multi-task#evaluation#transformer

Synergizing Physically Constrained MCMC and Chemical-Informed Gaussian Processes for Reaction Network Discovery

The paper introduces PC-MCMC-CIGP, a gray-box workflow that integrates spike-and-slab topology sampling with Chemical-Informed Gaussian Processes (CIGP) for enhancing reaction network discovery from sparse chemical data. It demonstrates improved parameter calibration and experimental design, achieving a 12.5% increase in yield on styrene epoxidation compared to a Gaussian Process Bayesian Optimization baseline. This approach is significant for practitioners as it effectively combines MCMC and GP methods under physical constraints, optimizing decision-making in experimental setups while addressing uncertainty in chemical reactions.

arXiv cs.AI33 d agofound 10 d ago#mcmc#gaussian_processes#chemical_reaction

Ten Digits on a Train: AI-Assisted Verification of Two Eigenvalue Problems

The article presents a human-AI collaboration that successfully verified numerical eigenvalues in challenging settings, specifically achieving ten decimal places of accuracy for a singular self-adjoint Schrödinger operator and resolving a non-normal atom-molecule resonance pair. The latter was accomplished by reformulating the problem into a global matching system for projective solution lines, utilizing a Krawczyk-Brouwer inclusion for certification. This work highlights the potential of AI in enhancing mathematical verification processes while emphasizing the necessity for rigorous standards in validation and the critical role of human oversight in mathematical proofs.

arXiv cs.AI33 d agofound 10 d ago#eigenvalue_problems#human_ai_collaboration

DTT-BSR+: A Generative-Regression Cascade for Music Source Restoration

DTT-BSR+ is a newly proposed two-stage cascade system for music source restoration (MSR) that separates the processes of distribution fitting and signal reconstruction. The first stage utilizes a generative DTT-BSR separator to produce clean source stems, while the second stage employs a modified Demucs network to refine these outputs using time-domain and multi-resolution spectral losses. This approach achieves superior multi-mel signal-to-noise ratio (MMSNR) compared to the previous DTT-BSR model and outperforms the state-of-the-art X-LANCE system, highlighting a significant advancement in balancing signal reconstruction accuracy with semantic consistency in MSR tasks.

arXiv cs.AI33 d agofound 10 d ago#music restoration#generative models#source separation

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

The paper introduces CORE-Bench, a benchmark designed to assess the computational reproducibility of AI agents in scientific research, comprising 270 tasks derived from 90 studies across multiple disciplines. It evaluates agents like AutoGPT and CORE-Agent using GPT-4o and GPT-4o-mini, with the top agent achieving only 21% accuracy on the most challenging tasks, highlighting significant room for advancement in automating scientific processes. This benchmark is crucial for enhancing reproducibility in research, enabling the development of more capable AI agents that can not only replicate but also innovate in scientific inquiry.

arXiv cs.AI33 d agofound 10 d ago#agents#reproducibility#benchmark

Grounded Chess Reasoning in Language Models via Master Distillation

The paper introduces a framework called Master Distillation for enhancing language models' reasoning capabilities in specialized domains, exemplified by a 4B parameter model named C1 applied to chess. C1 achieved 48.1% accuracy, outperforming existing open-source and proprietary models, while generating explanations with significantly fewer tokens than baseline methods. This approach captures the full reasoning process of expert systems, enabling compact models to produce transparent, explainable solutions, which is crucial for practitioners seeking to integrate grounded reasoning in AI applications.

arXiv cs.AI33 d agofound 10 d ago#grounded-reasoning#llm#chess

Legal Reasoning Is Not Lawyering: Rethinking Legal Benchmarks for Pro Se Access to Justice

The article critiques current legal AI benchmarks, arguing they primarily assess large language models (LLMs) under idealized conditions set by legal experts, rather than the more challenging scenarios faced by pro se litigants. It highlights the need for benchmarks that evaluate model robustness against noisy, incomplete, and error-prone inputs typical of self-represented individuals, citing issues like long-context sensitivity and hallucination. The authors advocate for developing metrics that accurately reflect model performance in these real-world conditions to ensure that claims about improving access to justice through legal AI are substantiated.

arXiv cs.AI33 d agofound 10 d ago#legal ai#benchmarks#llm

From "Aha Moments" to Controllable Thinking: Toward Meta-Cognitive Reasoning in Large Reasoning Models via Decoupled Reasoning and Control

The article introduces MERA, a meta-cognitive reasoning framework designed for Large Reasoning Models (LRMs) that decouples reasoning from control to enhance efficiency and accuracy. MERA utilizes a takeover-based pipeline to generate high-quality reasoning-control supervision data and employs Control-Segment Policy Optimization (CSPO) for training, allowing for independent optimization of control strategies. This approach addresses the issue of redundant reasoning in LRMs, reducing inference costs and latency, which is crucial for practical deployment in AI applications.

arXiv cs.AI33 d agofound 10 d ago#meta-cognition#reasoning#llm

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

The article announces the release of PHANTOM, a large-scale open-source dataset consisting of 47,524 pre-generated adversarial samples specifically designed for evaluating vision-language models (VLMs). Covering 10 high-level categories and 55 subcategories of harmful intents, the dataset consolidates existing benchmarks and introduces new categories to enhance the evaluation of model robustness and safety. This resource aims to facilitate systematic assessments of VLMs under adversarial conditions, supporting researchers in fine-tuning attack-generation models and developing defensive strategies.

arXiv cs.AI33 d agofound 12 d ago#adversarial-attacks#vision-language-models#dataset

The Geometry Behind Diffusion and Flow Matching: Gradient Flows and Geodesics in Wasserstein Space

The paper presents a unified geometric framework for understanding diffusion models and Flow Matching within the context of Wasserstein space, specifically $\mathcal{P}_2(\mathbb{R}^d)$. It establishes that diffusion models can be viewed as gradient flows of free energy, utilizing the Fokker-Planck equation and the JKO scheme, while Flow Matching operates along geodesics defined by the Benamou-Brenier formula. This integration of both approaches on a single Riemannian manifold highlights their relationship, allowing for more efficient sampling in generative processes by treating them as deterministic ODEs along optimal paths.

arXiv cs.AI33 d agofound 12 d ago#geometry#diffusion#gradient flows#Wasserstein

Tractable Reasoning and Conjunctive Query Answering for Defeasible DL-Lite under Rational Closure

This paper presents a novel plug-in architecture for efficient reasoning and conjunctive query answering under Rational Closure (RC) in the DL-Lite family of description logics. It demonstrates that both instance checking and CQ answering can be performed with minimal computational overhead by leveraging existing classical reasoners. This development is significant for practitioners as it enhances the capability of lightweight description logics to handle defeasible knowledge, thereby improving the efficiency of knowledge representation systems.

arXiv cs.AI33 d agofound 12 d ago#description logics#reasoning#non-monotonic

Exploring Academic Influence of Algorithms by Co-occurrence Network Based on Full-text of Academic Papers

This study presents a large-scale analysis of algorithm co-occurrence networks in natural language processing (NLP), utilizing deep learning models to extract algorithm entities from over four decades of academic papers. By constructing cumulative and annual co-occurrence networks, the research reveals structural characteristics and centrality measures that highlight the collective influence of algorithms, showing that classic and interdisciplinary algorithms maintain high centrality and popularity. This work lays the groundwork for understanding algorithmic influence in a network context, which is crucial for practitioners aiming to navigate the evolving landscape of AI research and applications.

arXiv cs.AI33 d agofound 12 d ago#algorithm influence#co-occurrence network#NLP#academic papers

Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

Neuro-Symbolic Drive introduces a novel framework for driving Vision-Language Agents (VLAs) that integrates rule-grounded reasoning from classical planners with Chain-of-Thought (CoT) reasoning. By fine-tuning the Qwen3.5-4B model using structured reasoning traces from rule-based planners, the framework achieves significant performance improvements on a simulator-generated benchmark, reducing Average Deviation Error (ADE) from 0.47 to 0.26 and miss rates from 8.30% to 6.40% with three-camera perception. This approach enhances the causal connection between reasoning and motion generation, offering a structured supervision method that could benefit practitioners in developing more reliable and interpretable driving AI systems.

arXiv cs.AI33 d agofound 12 d ago#neuro-symbolic#driving#ai

Invariant Graph Representations for Continuous-Time Dynamic Graphs Under Distribution Shifts

The article presents CIR, a novel framework for learning invariant representations in Continuous-Time Dynamic Graphs (CTDGs) under out-of-distribution (OOD) shifts, utilizing a structural causal model called ICCM. It incorporates the Normalized Weighted Geometric Mean (NWGM) for efficient interventional predictions and employs a deep learning architecture with subgraph extractors and an environment memory bank to handle distributional shifts. This advancement is significant for practitioners as it enhances the robustness and applicability of CTDG models in dynamic environments, addressing limitations of existing methods in OOD scenarios.

arXiv cs.AI33 d agofound 10 d ago#dynamic_graphs#OOD#representation_learning

Catastrophic Compositional Generation: Why Vanilla Diffusion Models Fail to Extrapolate

The paper presents a critical analysis of vanilla conditional diffusion models in the context of compositional generation, arguing that they struggle to extrapolate to target distributions defined by combinations of source distributions. The authors provide theoretical insights and experimental evidence indicating that score estimation errors significantly hinder performance, particularly when dealing with out-of-distribution targets, thus suggesting the necessity for alternative methodologies. This work is relevant for AI practitioners as it highlights the limitations of current diffusion models and the need for improved approaches in generative tasks involving unseen combinations of data.

arXiv cs.AI33 d agofound 10 d ago#compositional_generation#diffusion_models

Cost-Optimal Decision Diagrams for Stochastic Boolean Function Evaluation

The paper presents a novel branch-and-bound algorithm for the cost-optimal evaluation of stochastic Boolean functions, addressing the challenge of minimizing expected evaluation costs under variable costs and probabilistic truth assignments. This marks the first practical exact algorithm capable of handling such generality, with experimental results demonstrating its scalability and efficiency, alongside a greedy beam-search variant. The findings are significant for practitioners as they provide a new method for decision-making processes in AI applications where cost and efficiency are critical, particularly in domains like medical diagnosis.

arXiv cs.AI33 d agofound 10 d ago#decision diagrams#stochastic functions

Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

The study presents a computational analysis of Tang-dynasty poetry to predict the geographic origins of poets based on linguistic features. By constructing a corpus of 357 poets and employing character n-gram TF-IDF alongside interpretable features, the authors achieved a classification accuracy of 69% for broad regional origins, surpassing the majority baseline of 53%. The findings highlight the influence of geographic and temporal factors on poetic language, revealing a distance-decay effect and historical shifts in regional styles, while demonstrating that traditional methods like TF-IDF can effectively capture these linguistic signals, suggesting implications for the use of machine learning in literary analysis.

arXiv cs.AI33 d agofound 10 d ago#linguistic analysis#poetry#classification

EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence

The Evidence-Grounded Video Question Answering Benchmark (EG-VQA) has been introduced to address the gap between answer correctness and evidence grounding in VideoQA, consisting of 2,067 videos and 11,838 QA pairs with detailed temporal evidence annotations. The benchmark employs a new metric, Evidence-Grounded F1 (EG-F1), to evaluate both temporal alignment and semantic consistency of predictions against ground-truth evidence. Results indicate that existing models, including proprietary ones, struggle with evidence localization, highlighting the need for structured evidence supervision, which is addressed by the proposed EG-Reasoner model that achieves state-of-the-art performance among open-source models, particularly on reasoning-intensive tasks.

arXiv cs.AI33 d agofound 10 d ago#video-llm#benchmark#qa

RetiSEM: Generalising Causal Models for Fragmented Biomedical Data

RetiSEM is a newly proposed domain-constrained structural equation modeling (SEM) framework designed for causal graph recovery and mediation analysis in fragmented biomedical data. It organizes variables into biologically informed blocks, applies forbidden-edge constraints, and decomposes effects into total, natural direct, and natural indirect components. In evaluations across ten synthetic benchmarks and a real-world dataset, RetiSEM demonstrated lower structural error and improved causal accuracy compared to unconstrained baselines, making it a valuable tool for practitioners in biomedical AI facing incomplete data.

arXiv cs.AI33 d agofound 10 d ago#causal-models#biomedical#data

Beyond the Autoregressive Horizon: A Comprehensive Survey of Diffusion Models, World Modelling, and State Space Models for Code

The paper presents a survey on the limitations of autoregressive (AR) models in automated software engineering and explores alternative paradigms such as Diffusion Models, Code World Models (CWMs), and State Space Models (SSMs). Diffusion Models address the shortcomings of AR by enabling holistic denoising for long-range syntactic constraints, while CWMs and SSMs enhance reasoning and efficiency in code generation. This research is significant for practitioners as it highlights potential architectural advancements that could improve code intelligence and reasoning capabilities in AI systems.

arXiv cs.AI33 d agofound 10 d ago#diffusion models#code generation#autoregressive models

FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

FISHER, a foundation model for multi-modal industrial signal representation, addresses the M5 problem of data heterogeneity through a novel sub-band modeling approach that effectively manages variable sampling rates without resampling. Pre-trained via teacher-student self-distillation on external audio and music data, FISHER demonstrates superior performance against 24 state-of-the-art series encoders, achieving up to 16x smaller model sizes while maintaining high diagnostic accuracy. The establishment of the RMIS benchmark, which includes 19 datasets across four modalities, provides a robust framework for evaluating multi-modal industrial signal processing, making FISHER a significant advancement for practitioners in the field.

arXiv cs.AI33 d agofound 10 d ago#foundation_model#industrial_signals#data_analysis

Infinitesimal Causality

The paper presents a categorical framework for infinitesimal causality within Frobenius Markov categories, utilizing tangent-bundle semantics to model interventions as tangent deformations in information structures. It defines categorical causal sufficiency and explores the interaction between categorical and geometric Frobenius structures, highlighting the role of interventions as tangent vectors that influence classical information flow. This work is significant for practitioners as it provides a formalism that enhances understanding of causal inference and intervention modeling, which could improve the design of algorithms in structural causal models.

arXiv cs.AI33 d agofound 10 d ago#causality#semantics#interventions

Neuromorphic Speech Enhancement with Dual-Branch Spiking Neural Networks

The article introduces GSU-DBNet, a dual-branch spiking neural network architecture designed for neuromorphic speech enhancement, featuring a gated spiking unit (GSU). This model simultaneously processes the speech magnitude and complex spectra, achieving a PESQ score of 3.04 with just 394K parameters, which is significantly fewer than traditional ANN models (4.5%–10.6% of their parameters). This advancement in SNN architecture enhances energy efficiency and spatiotemporal feature representation, making it a relevant development for practitioners focused on efficient AI speech processing solutions.

arXiv cs.AI33 d agofound 10 d ago#spiking_neural_networks#speech_enhancement#neuromorphic

Graph Alignment for Benchmarking Graph Neural Networks and Learning Positional Encodings

This article introduces a new benchmarking methodology for graph neural networks (GNNs) centered on the graph alignment problem, which seeks to maximize overlapping edges between unlabeled graphs. The authors present techniques for generating graph alignment datasets of varying difficulty, demonstrating that anisotropic models outperform isotropic ones in structure-only tasks. Additionally, the study reveals that node embeddings from self-supervised GNN pre-training can serve as effective positional encodings for transformers, achieving 98% accuracy in graph structure reconstruction, and provides an open-source Python package for dataset generation and benchmarking.

arXiv cs.AI33 d agofound 10 d ago#graph_neural_networks#benchmarking#positional_encodings

Can Scale Save Us From Plasticity Loss in Large Language Models?

The study investigates plasticity loss in GPT-style Transformer models, focusing on their ability to adapt to new information after prior learning. Analyzed models ranged from 5M to 314M parameters, revealing that plasticity loss occurs even in larger architectures and follows a sublinear scaling law with model size. These findings indicate that while larger models may mitigate the effects of plasticity loss, simply increasing parameter count is insufficient to prevent this issue, impacting the design of continual learning systems in natural language processing.

arXiv cs.AI33 d agofound 10 d ago#plasticity loss#llm#continual learning

BioPIE: A Biomedical Protocol Information Extraction Dataset for Experiment Understanding

The Biomedical Protocol Information Extraction Dataset (BioPIE) has been released to enhance the extraction of structured knowledge from biomedical experiments, addressing challenges such as High Information Density (HID) and Multi-Step Reasoning (MSR). BioPIE provides procedure-centric knowledge graphs that detail entities, actions, and relations, facilitating fine-grained understanding of experimental protocols. This dataset allows for improved evaluation of information extraction methods and supports the development of question answering systems, ultimately aiding practitioners in laboratory automation and cross-disciplinary communication.

arXiv cs.AI33 d agofound 10 d ago#biomedical#data-extraction

Towards Version-aware Operations and Transaction Memories for Multi-layer MeMo

The paper introduces MeMo, a framework utilizing multi-layer correlation matrix memories (CMMs) to facilitate version-aware operations in language models, allowing for efficient knowledge updates without full retraining. It proposes a version-aware operation layer that includes high-level functions such as replace, obsolete, and rollback, which are implemented as primitive calls over sequences and tokens. This architecture aims to enhance the adaptability of language models by enabling structured edits and maintaining historical data, thereby improving the efficiency and effectiveness of knowledge management in AI systems.

arXiv cs.AI33 d agofound 10 d ago#language_models#memory

Multimedia and Visual Analytics in the Agentic Era

The paper presents a framework aimed at integrating multimedia and visual analytics to enhance actionable insights for professional users handling large multimedia collections. It highlights the need for improved accuracy, trustworthiness, and reasoning capabilities in foundation models and AI agents, suggesting a shift from purely algorithmic improvements to comprehensive multimedia analytics systems. This approach is significant for practitioners as it emphasizes the importance of user-centric design in AI tools, facilitating better collaboration between humans and AI in complex analytical tasks.

arXiv cs.AI33 d agofound 10 d ago#multimedia#visual_analytics#AI_agents

Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations

The authors present a novel evaluation framework for Sparse Autoencoders (SAEs) that quantifies the semantic alignment between SAE latents and human-annotated concepts, using a new method called Fully-Binary Matching Pursuit (FBMP). They introduce synthetic benchmarks, synCUB and synCOCO, to facilitate targeted attribute perturbations and propose the Targeted Attribute Perturbation Alignment Score (TAPAScore) to assess the interpretability of SAEs trained on CLIP and DINOv2 embeddings. This framework enables practitioners to better evaluate and optimize SAEs for interpretability, suggesting that moderate dictionary sizes yield the best performance in aligning concepts with human understanding.

arXiv cs.AI33 d agofound 10 d ago#autoencoders#interpretability#evaluation

Beyond Bayer: Task-Optimal Sensor Co-Design for Robust Autonomous-Driving Segmentation

The paper presents a novel approach to sensor co-design for autonomous driving segmentation, emphasizing the importance of optimizing camera measurements rather than solely relying on larger models. It introduces a differentiable RAW-to-task pipeline that learns optimal spectral colour-filter-array (CFA) weights, achieving improvements in mean Intersection over Union (mIoU) by +0.017 on the KITTI-360 dataset and +0.023 on ACDC, while demonstrating that co-designing optics leads to negative outcomes. This work is significant for practitioners as it highlights the potential for sensor-level optimizations to enhance model performance in diverse environmental conditions, independent of the downstream model architecture.

arXiv cs.AI33 d agofound 10 d ago#autonomous driving#sensor co-design#segmentation

The impact of generative artificial intelligence on academic development of Chinese students in humanities and social sciences

This study investigates the impact of generative artificial intelligence (GenAI) on the academic development of humanities and social sciences students in China through a large-scale survey. Key findings indicate that while over half of the students reported enhanced motivation and academic performance due to GenAI, concerns about accuracy and overreliance persist, alongside a call for ethical considerations and improved privacy protections. The results highlight the need for thoughtful curricular integration of GenAI, emphasizing practice-oriented training to maximize its educational potential while addressing the diverse experiences and challenges faced by students.

arXiv cs.AI33 d agofound 10 d ago#generative ai#education#academic development

A P\={a}ninian Foundation for Indic Language Processing

The article proposes a P\={a}ninian framework for natural language processing (NLP) in Indic languages, highlighting the shared morphosyntactic architecture derived from P\={a}nini's grammar, the Ast\={a}dhy\={a}y. It introduces a four-part benchmark suite aimed at improving the accuracy, data efficiency, and transferability of NLP systems for these languages by consolidating disparate resources into a unified framework. This approach could enhance model interpretability by examining whether neural models inherently capture P\={a}nini's linguistic categories, which is crucial for practitioners developing robust AI applications in this domain.

arXiv cs.AI33 d agofound 10 d ago#indic languages#natural language processing#computational architecture

Faithful by Construction: Claim-Anchored Attribution for Multi-Document Summarization

The article presents the Claim-Anchored Multi-document Summarization (CAMS) framework, which enhances multi-document summarization by providing fine-grained attribution and reducing hallucination in large language models (LLMs). CAMS operates through a modular Extract-Select-Rewrite process that extracts atomic claims with token-level provenance, clusters them, and rewrites summaries with clear links to source documents, achieving significant improvements in faithfulness and citation precision—lifting multi-source attribution accuracy by approximately 66%. This framework is crucial for practitioners as it offers a structured approach to ensure factual integrity in generated summaries, addressing common issues with traditional end-to-end LLMs.

arXiv cs.AI33 d agofound 10 d ago#multi_document_summarization#llm

Structural Kolmogorov-Arnold Convolutions: Learnable Function on the Values or the Filter Shape as Parameter-Efficient Alternative to Per-Edge Convolutional KANs

The article presents Structural Kolmogorov-Arnold Networks (KANs), which introduce a parameter-efficient approach by placing learnable functions in the convolution structure rather than on each edge. Three models are studied: SV-KAN, AG-KAN, and RF-KAN, with RF-KAN achieving 88.47% accuracy on CIFAR-10 using approximately 0.4M parameters, outperforming traditional convolutional methods and demonstrating the importance of content-adaptive filter shapes. This work highlights a significant reduction in parameters while maintaining high performance, making it relevant for practitioners seeking efficient architectures in deep learning.

arXiv cs.AI33 d agofound 10 d ago#convolutional networks#parameter-efficient

Exploring Dualistic Meta-Learning to Enhance Domain Generalization in Open Set Scenarios

The paper introduces a novel meta-learning strategy called MEDIC (dualistic MEta-learning with joint DomaIn-Class matching) aimed at enhancing domain generalization in open set scenarios, where label mismatches occur between source and target domains. MEDIC employs implicit gradient matching to optimize decision boundaries for both domains and classes, addressing the imbalance in sample distribution that affects traditional one-vs-all classifiers. Experimental results demonstrate that MEDIC outperforms existing methods in open set scenarios while retaining competitive performance in closed set generalization, making it a valuable approach for practitioners dealing with unseen classes in real-world applications.

arXiv cs.AI33 d agofound 10 d ago#meta_learning#domain_generalization#open_set

OpenMythos Benchmarks

The OpenMythos benchmarks have been released, showcasing performance across SWE-bench Pro, CyberGym, and cybench. Notably, the Qwen 3.6 27B model achieved a SWE Verified score of 75, reflecting discrepancies with previous versions due to changes in evaluation methods. The OpenMythos model, designed for cybersecurity tasks, demonstrates promising capabilities but indicates potential for further training to enhance performance, making it a relevant tool for practitioners in the AI security domain.

Reddit r/LocalLLaMA33 d agofound 21 d ago#openmythos#benchmarks

OpenMythos benchmarks

Reddit r/LocalLLaMA33 d agofound 21 d ago#openmythos#benchmarks

How GPT-5 helped immunologist Derya Unutmaz solve a 3-year-old mystery

GPT-5 Pro provided significant insights into T cell behavior, aiding immunologist Derya Unutmaz in resolving a three-year-old research mystery. This advancement may enhance understanding in cancer and autoimmune disease research, demonstrating the potential of large language models in complex scientific inquiries.

OpenAI News33 d agofound 21 d ago#gpt-5#immunology#breakthrough

Show HN: The Cascade Graph – An interactive map of AI and energy constraints

The Cascade Graph is an interactive visualization tool designed to illustrate the relationships between artificial intelligence developments and energy constraints. It enables practitioners to explore how energy consumption impacts AI model training and deployment, providing insights into optimizing AI workflows with energy efficiency in mind. This resource is significant for AI engineers looking to balance performance with sustainability in their projects.

Hacker News33 d agofound 12 d ago#ai#energy-constraints

Baidu: One-shot Long-horizon Parsing

Baidu has introduced a novel approach for one-shot long-horizon parsing, which aims to enhance the efficiency of parsing tasks over extended sequences. The framework leverages advanced techniques to allow for parsing with minimal examples, potentially improving performance in applications requiring long-context understanding. This development is significant for AI practitioners as it offers a new method to handle complex parsing scenarios with reduced training data, thereby optimizing resource usage in model training and deployment.

Reddit r/LocalLLaMA33 d agofound 21 d ago#baidu#parsing

Unlimited OCR: One-shot long-horizon parsing

The article introduces "Unlimited OCR," a new approach for long-horizon parsing that enables one-shot recognition of text in images. It leverages advanced algorithms to improve parsing accuracy and efficiency, making it suitable for applications that require processing extensive text data in a single pass. This development is significant for practitioners as it enhances the capabilities of OCR systems, potentially reducing the need for multiple passes and improving processing times in real-world applications.

Hacker News33 d agofound 12 d ago#ocr#parsing

I love GLM 5.2's attitude! It is a nice refresher from those bootlicker doormats they are feeding us. Does that come from training datasets related to the local culture?

GLM 5.2 has been noted for its direct and concise interaction style, contrasting with other models that may exhibit more agreeable or "sugar-coated" responses. The discussion raises questions about the influence of training datasets on model behavior, suggesting that cultural differences may significantly affect how AI models like GLM 5.2 perform in terms of attitude and focus. Understanding these cultural nuances can be crucial for practitioners when selecting or fine-tuning models for specific applications.

Reddit r/LocalLLaMA34 d agofound 21 d ago#glm-5.2#culture#training

Is there any reason for a lack of love for Gemma 4 26b?

The discussion highlights a perceived lack of attention towards the Gemma 4 26b model compared to Qwen 3.6 27b and 35b, particularly regarding its performance in tasks like retrieval-augmented generation (RAG) and personal assistant applications. Users are exploring the potential of Gemma 4 26b for these use cases but express concerns over its diminished popularity and comparative lack of community discourse. This raises questions about its effectiveness and suitability for practitioners focusing on building versatile AI applications on constrained hardware like the Nvidia 3090.

Reddit r/LocalLLaMA34 d agofound 21 d ago#gemma#model#discussion

Dual-Stream EEG Decoding for 3D Visual Perception

The paper presents a dual-stream EEG decoding model designed for 3D shape perception, inspired by the biological vision pathways. It utilizes separate modules for decoding object identity and spatial orientation, employing circular regression for angle prediction and EEG-conditioned multiview diffusion for 3D reconstruction. This work is significant for practitioners as it enhances understanding of neural mechanisms in 3D perception and offers a novel approach to decoding complex visual information from EEG signals.

arXiv cs.AI34 d agofound 16 d ago#brain decoding#3D perception#EEG

Quality and Agreement in Multilabel Emotion Annotation: A Case Study and Evaluation Framework

The paper presents a framework for multilabel emotion annotation that addresses the subjectivity of emotion labels by employing soft vote-share labels instead of traditional hard labels. It evaluates the impact of various aggregation methods on agreement estimates and emotion classifiers, demonstrating that soft supervision can capture annotator variance and improve model predictions, especially in scenarios with inherent ambiguity. This work is significant for practitioners in NLP as it offers insights into designing and evaluating emotion datasets, ultimately enhancing the robustness of emotion classification systems.

arXiv cs.CL34 d agofound 13 d ago#emotion#annotation#nlp

An Efficient and Effective Architecture for Large-Scale Traffic Prediction via Geometry-Adaptive Square Partitioning

The paper introduces SqLinear, a novel architecture for large-scale traffic prediction that employs a geometry-adaptive algorithm called Square Partition, which creates balanced, non-overlapping spatial regions for sensor data. This approach addresses the limitations of existing partitioning methods by ensuring high-quality data segmentation, and it incorporates a Hierarchical Linear Interaction (HLI) module that replaces traditional attention mechanisms with a linear interaction scheme, resulting in linear computational complexity. Extensive evaluations demonstrate that SqLinear achieves a 2.30% reduction in mean absolute error (MAE) on average and significantly improves training efficiency, making it a valuable advancement for practitioners working with large-scale spatiotemporal models in traffic systems.

arXiv cs.AI34 d agofound 15 d ago#traffic prediction#machine learning

Leveraging LaBSE with Progressive Curriculum Learning for Multicultural Polarization

The article introduces a novel architecture for detecting online polarization in multilingual and multicultural contexts, utilizing LaBSE embeddings to improve cross-lingual learning, resulting in a macro F1 score increase of up to 0.2 in low-resource languages. It also presents an ablation study on various encoder models from the Qwen model family within a retrieval-based prompting framework. This work is significant for practitioners as it addresses the challenge of data scarcity in low-resource languages, enhancing the capabilities of AI systems in understanding and mitigating online polarization.

arXiv cs.CL34 d agofound 13 d ago#polarization#multilingual#laBSE

Study on Quantitative Dynamic Epistemic Logic for Belief Revision

The paper presents a new framework for belief revision called $P*$, which extends the Dynamic Epistemic Logic (DEL) by incorporating a quantitative approach to epistemic states. It formalizes the AGM postulates within this framework and introduces revision operators that demonstrate greater expressive power than previous models, specifically addressing shortcomings in existing formalizations. This work is significant for practitioners as it provides a more nuanced understanding of belief dynamics, potentially enhancing decision-making algorithms in AI systems that rely on belief revision mechanisms.

arXiv cs.AI34 d agofound 20 d ago#belief revision#epistemic logic#dynamic logic

Neural Concept Verifier: Scaling Prover-Verifier Games via Concept Encodings

The Neural Concept Verifier (NCV) framework combines Prover-Verifier Games (PVGs) with expressive concept encodings to enhance verifiability in high-dimensional nonlinear classification tasks. By utilizing minimally supervised concept discovery models, NCV extracts interpretable concept encodings from complex inputs and employs a nonlinear predictor for decision-making, outperforming traditional concept-based models and pixel-based PVG classifiers in evaluations. This approach addresses the challenge of verifiability in AI systems, particularly in mitigating shortcut behavior, making it significant for practitioners focused on building interpretable and reliable AI models.

arXiv cs.AI34 d agofound 14 d ago#verifiability#concept encodings#classification

LISE : Listenable Interpretable Speaker Embeddings

The article introduces Listenable Interpretable Speaker Embeddings (LISE), a novel framework that decomposes pretrained speaker embeddings into a small set of interpretable components without requiring additional annotations. LISE maintains competitive automatic speaker verification (ASV) performance, showing negligible degradation in equal error rate (EER) on x-vector and ECAPA-TDNN models. This approach enhances the interpretability of speaker embeddings, evidenced by a listening experiment where participants achieved 83.9% accuracy in distinguishing speakers, which is significant for practitioners seeking to enhance transparency in ASV systems.

arXiv cs.CL34 d agofound 13 d ago#speaker embeddings#interpretability

In LLM Reasoning, there is Irrationality on top of Value Misalignment

The paper presents a formalization of "rational value risk," highlighting a gap between the reasoning strategies of aligned LLMs and their optimal rational counterparts. It evaluates various models, including Llama-3.1, GPT-5.2, and GPT-5.5, across multiple benchmarks such as GSM8K and MATH, revealing that while value alignment can mitigate this risk, it cannot fully eradicate it, and that reasoning length impacts rationality with diminishing returns. This work is significant for practitioners as it underscores the complexities of aligning LLMs with target value functions and the importance of inference strategies in maximizing utility.

arXiv cs.AI34 d agofound 20 d ago#llm#value alignment#rationality

DEMM-Bench: A Cross-Regime Benchmark for Agent-Runtime Governance-Evidence Sufficiency

DEMM-Bench is a newly introduced benchmark aimed at assessing the sufficiency of governance evidence in agent-runtime systems, based on the Decision Evidence Maturity Model (DEMM). It evaluates records across eight evidence regimes to determine their ability to reconstruct decision-level properties, revealing that existing baselines often overstate their sufficiency. This benchmark, which includes a dataset of 64 cases and various evaluation tools, is significant for practitioners as it provides a standardized method for assessing the maturity of decision-making evidence, enhancing the reliability of governance in AI systems.

arXiv cs.AI34 d agofound 20 d ago#agent governance#benchmark#evidence

CADRE: Stable, Parameter Efficient Adaptation of Medical Vision Language Models with Bounded Forgetting and Prior Drift

The article presents CADRE, a framework for the continual adaptation of medical vision-language models (VLMs) like BiomedCLIP, focusing on mitigating catastrophic forgetting and prior drift during updates for new imaging modalities. CADRE employs a frozen-backbone architecture that integrates low-rank adaptation (LoRA) with a similarity-aware elastic weight consolidation mechanism, achieving a reduction in forgetting by approximately sevenfold and demonstrating superior accuracy and backward transfer across diverse modalities (histopathology, ultrasound, chest radiography) while only training about 0.23% of the parameters. This work is significant for practitioners as it addresses critical safety and reliability concerns in deploying adaptive models in clinical settings, emphasizing stability over traditional accuracy benchmarks.

arXiv cs.AI34 d agofound 20 d ago#medical#adaptation

From Recognition to Understanding: Unlocking Cognitive Time Series Reasoning with LLMs

The article introduces TSCognition, a multimodal benchmark designed for multi-dimensional time series reasoning, comprising approximately 41K QA samples across five cognitive reasoning tasks. It also presents TSAlign, a framework that encodes time series into compact representations and aligns them with LLM embeddings using gated residual injection and multivariate fusion. Experimental results indicate that TSAlign significantly outperforms existing baselines on TSCognition and TimerBed while reducing computational costs, highlighting its potential for enhancing semantic understanding in time series analysis with LLMs.

arXiv cs.CL34 d agofound 13 d ago#time-series#reasoning#llm

Learning Splitting Heuristics for Parallel String Solvers

The paper presents a data-driven approach for automatically generating splitting heuristics in parallel string solvers, specifically implemented in Z3seq and Z3str4. By framing the selection of splitting atoms as a learning task, the method utilizes features from input formulas and dynamic solver execution data, resulting in improved performance over manually designed heuristics in terms of both the number of solved formulas and average solving time. This advancement is significant for practitioners as it enhances the efficiency of string constraint solving in multi-core environments, addressing the challenges posed by complex and undecidable constraints in real-world applications.

arXiv cs.AI34 d agofound 20 d ago#string solvers#parallel solving#learning

Unsupervised Disentanglement Without Compromises : How Functional Orthogonality Enforces Identifiability

This paper presents a novel approach to unsupervised disentangled representation learning by introducing a functional orthogonality constraint on the Jacobian of generative mappings. It demonstrates that this constraint ensures identifiability of nonlinear generative models without the need for statistical independence or causal assumptions, validated through experiments with orthogonality-regularized normalizing flows. This work challenges existing limitations in unsupervised disentanglement and offers a new theoretical foundation that could enhance model performance in applications requiring reliable factor recovery.

arXiv cs.AI34 d agofound 15 d ago#disentanglement#representation#learning

CourseBlueprint: A Structured Pipeline for Adaptive Pedagogical Video Generation Grounded in Course Corpora

CourseBlueprint introduces a structured pipeline for generating adaptive pedagogical videos using a biomedical imaging course corpus, leveraging a single forward pass to create a teaching blueprint based on learner personas. The system employs typed intermediate representations, including a prerequisite concept graph and an engagement generator, to enhance instructional effectiveness, with evaluation metrics indicating significant improvements in engagement and instructional quality when using explicit pedagogical contracts. This framework provides a reusable benchmark corpus and evaluation harness, highlighting the importance of structured instructional design in generative video systems for educational contexts.

arXiv cs.AI34 d agofound 20 d ago#generative#video#education

Meta-learning ecological priors from large language models explains human learning and decision making

The paper introduces Ecologically Rational Meta-learned Inference (ERMI), a new class of learning algorithms that utilizes large language models to generate ecologically valid cognitive tasks and employs meta-learning to optimize rational models for these environments. ERMI demonstrates superior performance in capturing human behavior across 15 experiments related to function learning, category learning, and decision-making, outperforming traditional cognitive models in trial-by-trial predictions. This framework highlights the potential for large language models to inform and enhance our understanding of human cognition by aligning learning processes with the statistical structures of real-world tasks.

arXiv cs.AI34 d agofound 14 d ago#meta-learning#human-cognition

Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results

This paper examines the performance discrepancies of large language models (LLMs) across multiple languages, revealing a consistent performance gap even for high-resource languages, attributed to translation errors and evaluation inconsistencies in existing benchmarks. The authors propose a semi-automatic quality assurance method to rectify these issues, demonstrating that addressing data quality can significantly alter conclusions about multilingual capabilities. They also release a corrected version of the multilingual math benchmark dataset (MGSM-Rev2) to facilitate improved evaluation practices in cross-lingual research.

arXiv cs.CL34 d agofound 13 d ago#llm#multilingual#evaluation

Cross-Architectural Mixture-of-Experts with Adaptive Soft Routing for Plant Leaf Disease Classification

The study introduces an adaptive soft Mixture-of-Experts (MoE) framework for plant leaf disease classification, integrating EfficientNet-B0, DenseNet-121, and Swin-Tiny architectures to leverage multi-scale features. Utilizing a soft gating mechanism for input-dependent expert weights and a two-stage refinement training strategy, the model achieved a recall of 91.68% and an F1-score of 92.62% on a potato leaf disease dataset, outperforming individual models. This approach addresses challenges in class imbalance and representation learning, offering significant improvements for practitioners in precision agriculture and crop health monitoring.

arXiv cs.AI34 d agofound 20 d ago#plant#classification

Physical-AI: From Channel Awareness to Environmental Intelligence in 6G Wireless Networks

The article introduces Physical-AI, a novel architecture designed for 6G wireless networks that enhances environmental intelligence by integrating sensing, modeling, and decision-making with traditional data transmission. It features a self-supervised spatiotemporal radio foundation model that creates a shared latent environmental representation from distributed radio observations, enabling the estimation of environmental properties like blockage and mobility dynamics through multiple inference heads. This approach demonstrates improved performance in reducing outage probability and blockage-response latency, making it significant for practitioners aiming to develop more adaptive and intelligent wireless communication systems.

arXiv cs.AI34 d agofound 20 d ago#6G#environmental#intelligence

A Reproducible Semantic Benchmark for Multivendor DSM-to-CLI Translation

The paper introduces a reproducible semantic benchmark for translating high-level network intents into multivendor configurations, addressing the challenge of operational correctness in network automation. It evaluates five cloud LLMs across three vendors and five use cases, revealing that semantic quality and operational reliability are independent, with vendor effects significantly influencing outcomes. This benchmark is crucial for practitioners as it enables rigorous comparison of LLM-based network configuration systems, highlighting the importance of repeated-execution metrics in assessing model performance.

arXiv cs.AI34 d agofound 20 d ago#llm#benchmark#network-automation

PEAR: Permutation-Equivariant Adaptive Routing Multi-Agent Debate

The article introduces Permutation-Equivariant Adaptive Routing Multi-Agent Debate (PEAR), a novel inference-time protocol designed to enhance multi-agent debate systems in large language models (LLMs) by dynamically adjusting communication roles and topologies. PEAR is characterized as an equivariant sparse router, which maintains accuracy despite agent relabeling and reduces routing complexity, leading to improved generalization. Empirical evaluations show that PEAR significantly outperforms existing debate baselines across four reasoning benchmarks and six LLM architectures, making it a valuable tool for practitioners aiming to enhance the reliability and performance of LLMs in multi-agent settings.

arXiv cs.AI34 d agofound 21 d ago#multi-agent debate#llm#routing

Mind the Noise: Sensitivity of Transformer-based Interaction-Aware Trajectory Prediction Models to Noisy Data

This paper analyzes the sensitivity of Transformer-based interaction-aware trajectory prediction models to noisy data, highlighting the degradation in prediction accuracy due to real-world perception uncertainties and localization errors. The study reveals that accuracy can decrease by a factor of 1.3 under small noise levels and up to 3.9 under high noise conditions, emphasizing the necessity for more realistic training datasets and effective noise mitigation strategies. This is crucial for practitioners in autonomous vehicle development, as it underscores the importance of robust data handling in enhancing model reliability.

arXiv cs.AI34 d agofound 20 d ago#trajectory prediction#transformers#noise sensitivity

Darwin Mobile Agent: A Roadmap for Self-Evolution

The article introduces the Darwin Mobile Agent, an open-source infrastructure aimed at facilitating autonomous reinforcement learning in complex environments by leveraging a mobile Graphical User Interface (GUI) as a proxy for real-world interactions. It addresses the data-collection bottleneck through an asynchronous agent-environment loop utilizing parallel cloud-phone instances, and outlines a roadmap for eliminating human priors in self-evolving agents across three key areas: task curricula, outcome verification, and memory management. This framework is significant for practitioners as it provides a scalable foundation for developing truly autonomous agents capable of adapting and evolving in dynamic environments.

arXiv cs.AI34 d agofound 21 d ago#self-evolution#reinforcement learning#agents

What Shapes Emergent Misalignment? Insights from Training Dynamics, Model Priors, and Data

The paper investigates emergent misalignment (EM) in machine learning models, focusing on how training dynamics, model priors, and data influence alignment during fine-tuning. It reveals that while in-domain training loss does not strongly correlate with improved out-of-domain alignment scores, there are predictive signals in activation patterns from pre-trained models that can inform fine-tuning outcomes. This research is significant for practitioners as it highlights the complexities of fine-tuning strategies and the importance of understanding activation shifts to mitigate misalignment in AI models.

arXiv cs.AI34 d agofound 20 d ago#emergent misalignment#training dynamics#model priors

Gated MLPs as Symmetry-Broken Rank-1 Bilinear Attention

This article presents a theoretical framework that interprets conventional gated MLPs as a rank-1 approximation of a bilinear attention mechanism, highlighting the role of symmetry-breaking in their architecture. The authors demonstrate that applying nonlinearity to one factor of the bilinear attention disrupts exchange symmetry and inverse-scaling symmetry, which may elucidate the practical effectiveness of gated MLPs. This insight could influence future architectural designs in AI by optimizing attention mechanisms for better performance.

arXiv cs.AI34 d agofound 16 d ago#attention#mlp#architecture

On the Identifiability of User Adaptation in Co-Adaptive Neural Interfaces

The paper presents an analysis of identifiability in co-adaptive neural interfaces, revealing that closed-loop encoder estimates fail to uniquely identify user adaptation, as they instead reflect characteristics of the entire human-machine system. It discusses the implications for understanding behavioral adaptation and proposes specific conditions necessary for achieving identification. This work is significant for practitioners as it highlights challenges in accurately interpreting user adaptation in interactive AI systems, which is crucial for improving user experience and system performance.

arXiv cs.AI34 d agofound 21 d ago#neural interfaces#user adaptation

SCOPE: Evolving Symbolic World for Planning in Open-Ended Environments

SCOPE is a new self-adaptive symbolic planning framework designed for open-ended environments, integrating a Symbolic Execution Simulator (SESim) and a Self-Adaptive Symbolic Memory (SASMem) to enhance planning through iterative refinement of symbolic representations. The framework addresses the challenges of incomplete symbolic worlds by enabling real-time validation and adaptation of action plans, significantly improving planning success rates and adaptability across various tasks. This development is crucial for practitioners as it enhances the robustness of planning systems in dynamic and uncertain environments, facilitating more effective long-horizon task execution.

arXiv cs.AI34 d agofound 20 d ago#symbolic planning#open-ended environments#VLM