Training — AI news — AI News Digest

I did some model hacks, and got GLM5.2 from about 2.5 tok/s to >50 tok/s on my GH200 system.

The author successfully modified the GLM-5.2 model to achieve significant performance improvements, increasing throughput from approximately 2.5 tokens/second to over 55 tokens/second on a custom GH200 system equipped with dual Hopper H100 GPUs and dual Grace CPUs. The optimization involved merging the MTP head from zai's GLM-5.2-FP8 repository with CyanKiwi's AWQ quantized version, requiring specific adjustments to the vLLM framework. This achievement highlights the potential for practitioners to enhance model performance through architectural tweaks and custom configurations, particularly in high-performance computing environments.

Reddit r/LocalLLaMA32 d agofound 12 d ago#llm#optimization

SEAL: Searching Expandable Architectures for Incremental Learning

SEAL is a newly introduced framework that integrates Neural Architecture Search (NAS) for data-incremental learning, addressing the challenge of balancing model plasticity and stability. It dynamically expands the model architecture only when necessary, guided by a capacity estimation metric, and employs cross-distillation training to mitigate forgetting. Experimental results show that SEAL improves accuracy and reduces resource usage, making it a promising approach for efficient incremental learning in resource-constrained environments.

arXiv cs.AI33 d agofound 10 d ago#incremental_learning#NAS#deep_learning

Adaptive Machine Learning Framework for UAV Trajectory Optimization in O-RAN

The article presents an adaptive machine learning framework for optimizing UAV trajectories within the O-RAN architecture, leveraging continual transfer learning. This framework utilizes a library of pre-trained models and a model selection mechanism to enhance efficiency and minimize adaptation time in dynamic environments, achieving a 44% to 56% reduction in convergence time compared to traditional retraining methods. The integration of real-world city maps and ray tracing techniques not only improves learning reliability but also enhances trajectory planning, which is crucial for practitioners developing UAV applications in 6G networks.

arXiv cs.AI33 d agofound 10 d ago#uav#trajectory-optimization#transfer-learning

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

The paper presents a novel approach to reinforcement learning (RL) focused on achieving broad and persistent model alignment by training on a dataset designed to enhance beneficial traits like truthfulness and fairness across diverse domains such as health and education. The study demonstrates that models trained with this beneficial trait RL outperform compute-matched baselines on over 80% of more than 50 independent out-of-distribution alignment benchmarks, indicating significant alignment transfer and improved robustness against adversarial prompts. This work is crucial for practitioners as it suggests a pathway to develop RL systems that are more resilient to misalignment and better aligned with human values in real-world applications.

arXiv cs.AI33 d agofound 12 d ago#reinforcement-learning#alignment#beneficial-models

LaGO: Latent Action Guidance for Online Reinforcement Learning

The paper introduces Latent Action Guidance for Online Reinforcement Learning (LaGO), which utilizes a pretrained large language model (LLM) to provide latent action priors that enhance online policy optimization, rather than functioning as a direct controller. Experiments on the CLEVR-Robot and Meta-World benchmarks reveal that LaGO improves average success rates significantly, from 15.1% to 27.2% on CLEVR-Robot and from 2.7% to 15.2% on Meta-World, indicating that leveraging LLMs can effectively enhance planning and decision-making in reinforcement learning contexts. This approach may offer practitioners a more reliable method for integrating LLMs into reinforcement learning frameworks.

arXiv cs.AI33 d agofound 10 d ago#reinforcement learning#policy optimization#latent action

Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models

This study introduces the Match Task to Objective (MTO) framework, which optimizes encoder-decoder pre-trained language models by aligning pre-training objectives with specific tasks, enhancing performance in generation and question answering tasks, particularly in commonsense knowledge retrieval. The framework employs automated methods for unsupervised data preparation and novel fine-tuning templates, achieving over 120% performance improvement in few-shot settings compared to conventional methods. The findings provide critical insights for practitioners on model customization and prompt-tuning strategies, with the accompanying code available for implementation.

arXiv cs.AI33 d agofound 10 d ago#fine-tuning#prompt-tuning#language models

DynaWM: Dynamics-Aware Distillation with World Model and Momentum Targets for Smooth Locomotion over Continuous Stairs

The article introduces DynaWM, a dynamics-aware representation learning framework designed to improve bipedal-wheeled robots' ability to traverse continuous stairs. Key innovations include the incorporation of a world model as a regularizer for enhanced terrain encoding and a momentum target encoder to stabilize knowledge transfer during distillation. Experimental results indicate that DynaWM significantly improves terrain adaptability and motion smoothness, making it relevant for practitioners focused on advancing robotic locomotion in complex environments.

arXiv cs.AI33 d agofound 10 d ago#representation learning#dynamics-aware#knowledge transfer

Representation Interventions Enable Lifelong Knowledge Memory Control in LLMs

The paper introduces RILKE (Representation Intervention for Lifelong KnowledgE Control), a method designed to enable efficient knowledge updates in large language models (LLMs) without retraining. RILKE employs representation-space interventions to achieve fine-grained control over complex knowledge while keeping base weights frozen, utilizing paraphrase-robust and edit-localized modules to minimize interference during updates. Tested on LLaMA and Qwen models, RILKE demonstrates high edit success and paraphrase generalization across large-scale benchmarks, offering a practical solution for practitioners needing to manage evolving knowledge in LLMs.

arXiv cs.AI33 d agofound 10 d ago#llm#knowledge-control

Accelerating Disaggregated RL for Visual Generative LLMs with Diffusion-Based Parallelism and Trainer-Assisted Generation

The article introduces DigenRL, a disaggregated reinforcement learning framework designed for diffusion-based generative large language models (LLMs). Key innovations include a generation-axis pipeline (GAP) and time-step parallelism (TSP) for enhanced pipelining, an elastic trainer-assisted generation (TAG) approach for dynamic resource allocation, and an asynchronous strategy to optimize pipeline utilization. Experimental results demonstrate that DigenRL achieves throughput improvements of 1.56 to 2.10 times over existing systems like veRL-Omni and GenRL, making it a significant advancement for practitioners working on efficient RL systems in generative AI.

arXiv cs.AI33 d agofound 12 d ago#reinforcement-learning#llm#diffusion#agents

Impatient Bandits: Optimizing for the Long-Term Without Delay

The paper presents a novel approach to optimizing recommender systems for long-term user satisfaction by addressing the challenge of delayed rewards in a bandit framework. It introduces a predictive model that integrates historical data to estimate delayed rewards and a bandit algorithm that leverages this model to identify content that promotes sustained user engagement. The proposed method shows significant improvements over traditional short-term and delayed reward optimization strategies in a large-scale podcast recommendation system, highlighting its practical applicability for enhancing user experience in real-world applications.

arXiv cs.AI33 d agofound 10 d ago#bandits#long_term#recommender_systems

An Introduction to Causal Reinforcement Learning

The article introduces the concept of Causal Reinforcement Learning (CRL), which integrates causal inference principles with reinforcement learning (RL) methodologies. It proposes a formalization of environments as structural causal models, allowing for a unified approach to various learning modalities, including online, off-policy, and imitation learning. This integration is significant for practitioners as it opens new avenues for optimizing RL policies by leveraging counterfactual reasoning, enhancing the understanding of agent behavior in complex environments.

arXiv cs.AI33 d agofound 12 d ago#causal inference#reinforcement learning#counterfactuals

Scaling Laws for Task-Specific LLM Distillation

The paper presents empirical scaling laws for the distillation of task-specific large language models (LLMs), focusing on the trade-offs between in-domain and general knowledge performance as influenced by dataset size, compression ratio, and supervision format. It introduces a blended chain-of-thought supervision loss to enhance distillation stability and compares logit-based and LoRA-based approaches under iterative structural pruning, revealing that supervision format significantly impacts performance retention during compression. The authors release the FinHeadlineMix dataset and provide practical guidelines, offering a framework for practitioners to make informed decisions on domain-specific LLM compression strategies.

arXiv cs.AI33 d agofound 10 d ago#llm#distillation#scaling laws

Variational Model Merging for Pareto Front Estimation in Multitask Finetuning

The article introduces a new Bayesian approach called Variational Model Merging, aimed at enhancing the quality of Pareto front estimates in multitask finetuning by using flexible non-Gaussian posteriors. This method builds on existing model-merging techniques and demonstrates that utilizing more complex posterior distributions leads to superior estimates of Pareto fronts, validated through empirical results on vision and language transformers. This advancement is significant for practitioners as it provides a more efficient way to determine optimal task-mixing strategies, potentially reducing computational costs associated with Pareto front estimation.

arXiv cs.AI33 d agofound 10 d ago#finetuning#pareto_fronts#model_merging

Minimisation of Quasar-Convex Functions Using Random Zeroth-Order Oracles

This paper presents a random Gaussian smoothing zeroth-order (ZO) algorithm for minimizing quasar-convex (QC) and strongly quasar-convex (SQC) functions, establishing convergence and complexity bounds for both unconstrained and constrained scenarios. It introduces the concept of proximal-quasar-convexity for constrained optimization and shows that the algorithm can converge to a controlled neighborhood of the global minimum. These findings have practical implications for machine learning applications, particularly in areas like linear dynamical system identification and generalized linear models, where quasar-convexity is relevant.

arXiv cs.AI33 d agofound 10 d ago#optimization#quasar-convexity#machine_learning

Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training

The paper introduces BBoxER, an evolutionary black-box optimization method for post-training large language models (LLMs) that avoids gradient exposure, addressing privacy and security concerns. BBoxER employs an information bottleneck and provides non-vacuous generalization bounds, demonstrating improved performance on reasoning benchmarks and robustness against membership inference and data poisoning attacks. This method offers a viable alternative for practitioners seeking to enhance LLM training in sensitive environments while ensuring strong theoretical guarantees.

arXiv cs.AI33 d agofound 10 d ago#black_box#LLM#post_training

Fast and Slow Variational Continual Learning

The paper introduces the Continual IVON (CoVON) optimizer, which integrates fast and slow adaptation mechanisms into the Variational Continual Learning (VCL) framework to enhance continual learning in deep networks. By merging past posteriors to create a stable prior for fast-weight updates, CoVON demonstrates superior performance over existing VCL optimizers and traditional weight-regularization methods in domain-incremental learning and fine-tuning of large language models. This advancement is significant for practitioners as it provides a more effective optimization strategy for maintaining model performance during continual learning scenarios.

arXiv cs.AI33 d agofound 10 d ago#continual_learning#optimization

OpenThoughts-Agent: Data Recipes for Agentic Models

The OpenThoughts-Agent (OT-Agent) project has introduced a comprehensive data curation pipeline aimed at enhancing the training of agentic language models. By conducting over 100 controlled ablation experiments, the team assembled a dataset of 100,000 examples, fine-tuning the Qwen3-32B model, which achieved an average accuracy of 44.8% across seven benchmarks, surpassing the previous best open model, Nemotron-Terminal-32B, by 3.9 percentage points. This release, including the training sets and experimental data, provides valuable resources for practitioners aiming to develop more capable and generalizable agentic models.

arXiv cs.AI33 d agofound 10 d ago#agentic models#data curation#fine-tuning

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation

The paper introduces ARIA, a framework for adaptive region-based importance allocation in the distillation of conditional diffusion models, addressing the challenge of transferring knowledge from a large teacher model to a smaller student model. ARIA enhances training efficiency by dynamically focusing on regions of the conditioning space where alignment between teacher and student is poor, leading to improved performance, particularly in unseen and underrepresented conditions. This approach offers a solution to the bottleneck of limited paired image-condition data, making it relevant for practitioners dealing with large conditioning corpora in model distillation.

arXiv cs.AI33 d agofound 10 d ago#knowledge_distillation#conditional_diffusion

EMAgnet: Parameter-Space EMA Regularization for Policy Gradient Self-Play in Large Games

The article introduces EMAgnet, a novel regularization method for policy gradient self-play in large games, which uses an exponential moving average (EMA) of the last-iterate policy's parameters as a dynamic target for regularization. This approach adapts to the agent's evolving strategy, leading to improved performance over traditional uniform distribution targets, particularly in two-player zero-sum games with exploration challenges and dominated strategies. EMAgnet demonstrates lower exploitability compared to PPO with uniform regularization, making it a significant advancement for practitioners working on reinforcement learning in complex game environments.

arXiv cs.AI33 d agofound 10 d ago#policy_gradient#self_play

Task Decomposition for Efficient Annotation

The article introduces a method for task decomposition in structured annotation to enhance efficiency and reduce the inferential load on annotators. It presents a formal model based on centering theory to identify salient anchor entities, allowing for the effective breakdown of complex annotation tasks into manageable sub-tasks. This approach not only improves cost-efficiency but also optimizes the allocation of sub-tasks among heterogeneous annotators, which is crucial for practitioners aiming to streamline annotation processes in large-scale AI projects.

arXiv cs.AI33 d agofound 10 d ago#annotation#structured-data#efficiency

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

The paper introduces "spec learning," a framework designed to align large language models (LLMs) with user preferences at inference time without requiring parameter updates. It utilizes brief user instructions and a small set of preference judgments to create natural-language prompts that condition LLM behavior, demonstrating improved performance over direct preference optimization (DPO) on specialized datasets. This approach enhances interpretability and transparency in model responses, making it a valuable tool for practitioners seeking efficient and effective model steering methods.

arXiv cs.AI33 d agofound 10 d ago#spec_learning#llm

FlowPipe: LLM-Enhanced Conditional Generative Flow Networks for Data Preparation Pipeline Construction

FlowPipe introduces a novel framework for constructing data preparation pipelines using Conditional Generative Flow Networks (C-GFlowNets) and a Trajectory Balance objective to improve long-horizon credit assignment and exploration efficiency. By integrating Deep Semantic Modulation via Feature-wise Linear Modulation (FiLM), it allows for better conditioning of the pipeline decisions based on dataset semantics. Evaluated on 74 real-world datasets, FlowPipe demonstrates an average accuracy improvement of 11.96% and a 12.5x increase in training convergence speed compared to state-of-the-art methods, making it a significant advancement for practitioners in automated data pipeline construction.

arXiv cs.AI33 d agofound 10 d ago#data-preparation#pipeline#ml

Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

The paper presents findings on the effectiveness of multilingual versus English-only encoders in adapting streaming speech recognition models for new languages, using a 0.6 B-parameter FastConformer transducer across eight European languages. The study reveals that the advantage of multilingual initialization diminishes with increased target-language data, becoming negligible at 2500 hours, while streaming latency does not significantly impact performance. Additionally, 4-bit weight-only quantization reduces model size by approximately three times with a minimal increase in word error rate, providing practical guidelines for practitioners in low-data scenarios and independent decision-making on latency and quantization.

arXiv cs.AI33 d agofound 12 d ago#speech recognition#multilingual#data scale

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

The paper introduces Strategy-Guided Policy Optimization (SGPO), a novel framework that enhances reasoning capabilities in language models by distilling reusable strategies instead of merely imitating specific solution trajectories. SGPO employs a token-level forward-KL objective to transfer strategic guidance into unguided policies and utilizes adaptive instance-level weighting to optimize the distillation process based on model competence. Experimental results demonstrate that SGPO significantly outperforms traditional methods, including supervised fine-tuning and reinforcement learning approaches, achieving an average score improvement of 2.2 points on the Qwen2.5-7B-Instruct model across four mathematical benchmarks, highlighting its potential for enhancing generalization in AI applications.

arXiv cs.AI33 d agofound 12 d ago#policy optimization#LLM#reasoning#strategy-guided

Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation

The article presents a novel multi-objective reinforcement learning framework called Semantic Pareto-DQN for improving recommender systems. By formalizing the recommendation process as a semantic multi-objective Markov decision process, it employs a Pareto-DQN agent that optimizes for engagement, diversity, and fairness without aggregating these objectives into a single reward signal. Empirical results from the MovieLens dataset demonstrate that this approach enhances societal objectives while maintaining user engagement, offering a significant advancement for practitioners aiming to build responsible AI systems that mitigate filter bubbles.

arXiv cs.AI33 d agofound 12 d ago#reinforcement learning#recommendation#multi-objective#DQN

Co-occurring associated retained concepts in Diffusion Unlearning

The article introduces ReCARE (Robust erasure for CARE), a novel framework designed to enhance unlearning in diffusion models by preserving co-occurring associated retained concepts (CARE) while effectively erasing target concepts. It defines the CARE score as a metric for quantifying the preservation of these concepts and presents extensive experimental results demonstrating that ReCARE achieves state-of-the-art performance in maintaining utility and concept erasure across various targets, including nudity and artistic styles. This advancement is significant for practitioners as it addresses the challenge of harmful content generation without compromising the generation of benign associated concepts.

arXiv cs.AI33 d agofound 10 d ago#unlearning#diffusion models#content generation

Breaking Shortcut Learning for Cross-Trial EEG-Guided Target Speech Extraction via Two-Stage Training

The article introduces TRUST-TSE, a two-stage framework designed to improve generalization in EEG-guided target speech extraction by mitigating shortcut learning associated with trial-specific EEG structures. It employs contrastive pretraining with negative sampling and a confidence-weighted extraction objective to enhance EEG-speech alignment and suppress trial-identity cues. Experimental results on KUL and DTU datasets demonstrate that TRUST-TSE significantly outperforms existing end-to-end models under cross-trial evaluation, offering a more reliable solution for neuro-steered hearing technologies.

arXiv cs.AI33 d agofound 10 d ago#shortcut learning#speech extraction#neuro-steered

Blockwise Policy-Drift Gating for On-Policy Distillation

The paper introduces blockwise policy-drift gating, a method designed to enhance on-policy distillation (OPD) for long-horizon reasoning tasks by implementing a lightweight drift controller that operates solely on the student policy. This approach computes log-probability shifts between the behavior and current student policies over fixed blocks, improving the mean pass@8 metric from 0.4978 to 0.5160 in a six-variant Qwen3 math reasoning benchmark, suggesting that block-level gating can effectively stabilize performance in OPD scenarios. This advancement is significant for practitioners as it offers a straightforward mechanism to improve robustness in model training without altering teacher targets or rollout policies.

arXiv cs.AI33 d agofound 10 d ago#policy distillation#on-policy#reinforcement learning

Weight-Space Geometry of Offline Reasoning Training

The paper analyzes six offline reinforcement-learning training methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) using a shared model (Qwen3-4B) and attention-only LoRA to compare their weight-space geometry and performance on downstream tasks. The findings indicate that SFT, RFT, and RIFT yield similar weight updates and comparable accuracy (87-88% on GSM8K), while DPO achieves the highest accuracy (93.5%) but requires a significantly smaller learning rate, suggesting that optimizer and loss function choices critically influence performance. This research provides insights into the mechanistic differences between methods, which is essential for practitioners to optimize training strategies in offline reinforcement learning.

arXiv cs.AI33 d agofound 10 d ago#reinforcement learning#offline training#reasoning

UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

The release includes two new quantized models, Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf and Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf, optimized for 16GB VRAM on NVIDIA GPUs. Both models maintain a perplexity (PPL) score around 7.41, demonstrating similar performance, while the second model leverages the Trellis algorithm for quantization, applied selectively to tensors with Gaussian distributions. These advancements are significant for practitioners focused on optimizing LLMs for resource-constrained environments, particularly in coding tasks.

Reddit r/LocalLLaMA33 d agofound 21 d ago#qwen#quantization#nvidia

Is it possible to run a giant model like GLM5.2 on this cluster (4x servers with 512GB RAM + dual AMD Epyc)? 16 channel memory should hit 409GB/s per node.

The discussion revolves around the feasibility of running the 467GB Unsloth 4-bit GLM 5.2 model on a cluster of four Dell C6525 servers, each equipped with dual AMD EPYC 7702 processors and 512GB of DDR4 RAM, totaling 2TB of RAM and achieving a memory bandwidth of 409.6 GB/s per node. The user is exploring options for either maximizing token processing speed or accommodating larger model sizes by clustering the servers, despite the absence of GPUs. This scenario is significant for practitioners as it highlights the potential for efficient large model deployment using CPU-only architectures and the importance of memory bandwidth in handling substantial model sizes.

Reddit r/LocalLLaMA33 d agofound 21 d ago#hardware#model#cluster

I benchmarked 8 LLMs for medical scribing. Hallucinations were rare; omissions need attention.

A benchmark study evaluated eight large language models (LLMs) for medical scribing using 300 synthetic doctor-patient dialogues, focusing on their ability to generate SOAP notes. The results revealed 12 confirmed high-impact hallucinations and 520 instances of omitted clinically relevant safety facts, indicating that omissions are a more significant issue than hallucinations. Notable performers included GPT-5.4-mini for cost and speed, while DeepSeek showed promise in prose quality but had many omissions, suggesting that integrating a safety layer with lower-cost models could enhance their clinical utility.

Reddit r/LocalLLaMA33 d agofound 21 d ago#benchmark#medical#scribing

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

The analysis presents a detailed mapping of Kullback-Leibler Divergence (KLD) for key-value (KV) cache quantization in the Qwen3.6-35B-A3B and Gemma4-E2B models. It reveals that quantization levels q8/q8 are nearly lossless for both models, while q4/q4 is effective for Qwen but detrimental for Gemma. Additionally, turbo quantization methods allow for significant cache compression, albeit with performance trade-offs. This information is crucial for practitioners optimizing LLMs, particularly in balancing model performance and resource efficiency during deployment.

Reddit r/LocalLLaMA33 d agofound 21 d ago#kv-cache#quantization#qwen

Elevated error rate across multiple models

The article discusses an observed elevated error rate across various AI models, highlighting potential issues in model robustness and performance consistency. While specific models and benchmarks are not detailed, this trend raises concerns for practitioners regarding the reliability of model outputs in production environments. Addressing these elevated error rates is crucial for enhancing the stability and trustworthiness of AI systems.

Hacker News33 d agofound 12 d ago#error-rate#models

Prime Intellect Releases prime-rl 0.6.0 to Train Trillion-Parameter MoE Models on Agentic RL Workloads

Prime Intellect has released prime-rl 0.6.0, an open framework designed for asynchronous reinforcement learning on trillion-parameter Mixture-of-Experts (MoE) models. The framework successfully trained the GLM-5 model on SWE tasks with a maximum sequence length of 131k, achieving sub-5-minute step times and utilizing 256 rollouts across 28 H200 nodes. Key optimizations include FP8 inference, Wide Expert Parallelism, and various forms of parallelism (FSDP, EP, CP), which enhance both training efficiency and model performance, making it a significant tool for practitioners working with large-scale RL applications.

MarkTechPost34 d agofound 21 d ago#prime_intellect#reinforcement_learning#moe

Scaling Small Agents Through Strategy Auctions

This article introduces the Strategy Auctions for Workload Efficiency (SALE) framework, which enhances the performance of small language models in agentic AI tasks by employing a bidding system for strategic plans. SALE demonstrates a 52% reduction in reliance on larger models and a 35% cost reduction across deep search and coding tasks, while improving performance metrics with minimal overhead. This approach highlights the potential for efficient coordination among smaller agents, suggesting that performance improvements in agentic AI may stem more from sophisticated task allocation strategies than from simply scaling model size.

arXiv cs.AI34 d agofound 14 d ago#agents#performance#workload

Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers

The paper introduces HyperAdapter, a novel hypergraph-based adapter architecture for parameter-efficient fine-tuning of vision transformers (ViTs). By adapting in hyperedge space rather than token space, HyperAdapter leverages structured relationships among tokens through soft token routing, resulting in improved feature refinement and performance across various visual benchmarks. This approach demonstrates that the adaptation space significantly impacts the effectiveness of parameter-efficient transfer methods, particularly for tasks necessitating structured reasoning.

arXiv cs.AI34 d agofound 14 d ago#fine-tuning#vision-transformers#parameter-efficient

A Robust Framework for Secure Cardiovascular Risk Prediction: An Architectural Case Study of Differentially Private Federated Learning

The paper introduces FedCVR, a robust Federated Learning framework designed for secure cardiovascular risk prediction across heterogeneous clinical networks, emphasizing its application of Differential Privacy (DP). It demonstrates that integrating server-side momentum as a temporal denoiser enables the model to achieve an F1 score of 0.78 and an AUC of 0.96 while maintaining a privacy budget (epsilon ~ 13.4). This research highlights the importance of server-side adaptivity in recovering clinical utility under privacy constraints, offering a validated framework for multi-institutional collaboration in AI-driven healthcare.

arXiv cs.AI34 d agofound 14 d ago#federated learning#privacy#AI models

Active Causal Experimentalist (ACE): Learning Intervention Strategies via Direct Preference Optimization

The Active Causal Experimentalist (ACE) framework introduces a novel approach to learning intervention strategies through Direct Preference Optimization, allowing for adaptive experimental design as a sequential policy. ACE demonstrates a 70-71% improvement over traditional methods across various benchmarks, utilizing pairwise intervention comparisons rather than relying on absolute reward magnitudes, thereby addressing the instability of value-based reinforcement learning. This advancement is significant for practitioners as it enables the autonomous discovery of effective experimental strategies, enhancing the efficiency and effectiveness of causal inference in complex domains.

arXiv cs.AI34 d agofound 14 d ago#reinforcement-learning#intervention#policy

Over-the-Air Federated Learning: Rethinking Edge AI Through Signal Processing

The article presents Over-the-Air Federated Learning (AirFL), a novel approach that integrates wireless signal processing with distributed machine learning to enhance AI scalability at the edge. AirFL utilizes wireless superposition to aggregate local model updates into an analog signal, significantly reducing communication latency, bandwidth, and energy consumption. The paper categorizes existing AirFL schemes into three classes—CSIT-aware, blind, and weighted—while discussing their performance trade-offs, complexities, and potential applications in practical wireless edge-AI systems, providing insights for practitioners in optimizing federated learning in constrained environments.

arXiv cs.AI34 d agofound 14 d ago#federated-learning#edge-ai

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

The article presents Trajectory Matching Policy Optimization (TMPO), a novel approach to align diffusion models with downstream tasks, addressing issues of reward hacking and mode collapse in reinforcement learning. TMPO replaces traditional scalar reward maximization with a Softmax Trajectory Balance objective, enabling trajectory-level reward distribution matching to enhance generative diversity by 9.1% compared to existing methods. Additionally, it employs Dynamic Stochastic Tree Sampling to optimize training efficiency by reducing redundant computations, making it a significant advancement for practitioners seeking to improve generative model performance and diversity.

arXiv cs.AI34 d agofound 14 d ago#reinforcement-learning#diffusion-models#policy-optimization

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

The paper introduces the Optimal Token Baseline (OTB) for improving Reinforcement Learning (RL) in Large Language Models (LLMs) by addressing the issue of exploding gradient variance in long-horizon tasks. The OTB is derived from first principles, proposing a method where gradient updates are inversely weighted by their cumulative gradient norm, and utilizes a Logit-Gradient Proxy to efficiently estimate this norm with only forward-pass probabilities. This approach enhances training stability and reduces token consumption by over 65% while achieving comparable performance to larger group sizes, making it significant for practitioners seeking efficient RL training methods in LLM applications.

arXiv cs.AI34 d agofound 14 d ago#reinforcement-learning#variance#baseline

Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models using Reinforcement Learning from Ranking Feedback

Oracle-RLAIF is a new fine-tuning framework for multi-modal video models that enhances video comprehension by employing reinforcement learning from ranking feedback instead of traditional supervised fine-tuning. The framework introduces an Oracle ranker that ranks model responses, replacing the costly human feedback process, and utilizes a novel rank-based loss function, $GRPO_{rank}$, for optimizing ordinal feedback. This approach demonstrates improved performance over existing video-language models across various benchmarks, offering a more cost-effective and flexible method for aligning large-scale models.

arXiv cs.AI34 d agofound 14 d ago#fine-tuning#video-models

Fine-Tuning Large Language Models for Quantum Reasoning

The study presents two fine-tuning pipelines for large language models (LLMs) aimed at enhancing quantum reasoning capabilities. The first pipeline, Supervised Fine-Tuning (SFT), achieves near-perfect accuracy in predicting measurement probability distributions from quantum circuit simulations, outperforming the base model and GPT-OSS-120B. The second approach, SFT combined with Group Relative Policy Optimisation (GRPO), improves generalization to larger qubit systems, indicating that targeted fine-tuning on explicit reasoning traces is a viable method for developing LLMs capable of sophisticated quantum reasoning, which is crucial for applications in quantum computing.

arXiv cs.AI34 d agofound 16 d ago#quantum reasoning#fine-tuning#llm

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

The paper introduces Gradient Alignment Control (GAC), a method designed to stabilize asynchronous reinforcement learning (RL) training for large language models (LLMs) by addressing the instability caused by high cosine similarity in policy gradients during asynchronous updates. GAC employs gradient projection to regulate training dynamics, ensuring convergence even under conditions of bounded staleness. This approach allows practitioners to leverage asynchronous execution in RL without sacrificing training stability, potentially enhancing the efficiency of LLM training processes.

arXiv cs.AI34 d agofound 14 d ago#reinforcement learning#asynchronous#gradient alignment

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

The paper introduces Uncertainty-Balanced Preference Planning (UBP2), a model-based approach for preference-based reinforcement learning that improves sample efficiency by actively directing exploration through uncertainty reasoning in reward, dynamics, and value functions. UBP2 employs ensembles to evaluate candidate trajectories based on a unified score that incorporates expected reward and epistemic uncertainty, achieving sublinear regret guarantees in both finite and infinite horizons. Empirical results demonstrate that UBP2 significantly outperforms existing model-free and non-optimistic model-based methods on the Meta-World benchmark, making it a valuable tool for practitioners focused on efficient exploration in reinforcement learning tasks.

arXiv cs.AI34 d agofound 13 d ago#reinforcement learning#preference-based#planning

Backpropagating Through Simulation: Analytic Policy Gradients for Sample and Learning Efficient Differentiable Continuous Control

The article introduces Analytic Policy Gradients (APG), a method that allows for exact gradient computation in model-free reinforcement learning by backpropagating through differentiable environment dynamics, contrasting with Proximal Policy Optimization (PPO) which relies on high-variance sampled rewards. APG was evaluated on four continuous control tasks, demonstrating improved sample efficiency by employing a multi-axis evaluation protocol that separates performance metrics based on environment and gradient steps. This approach is significant for practitioners as it enhances learning efficiency in complex environments, potentially reducing the number of interactions needed for effective policy training.

arXiv cs.AI34 d agofound 16 d ago#reinforcement learning#policy gradients#sample efficiency

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning for Autonomous Driving

The article presents FAST, a framework designed to enhance sampling efficiency in parallel reinforcement learning for autonomous driving. FAST introduces Dynamic Parallel Sampling Alignment (DPSA) to address the straggler effect by extending terminated episodes and implementing global truncation based on termination rates, which allows for improved sample utilization without re-initialization delays. Empirical results show that FAST achieves a minimum of 1.78 times wall-clock speedup compared to single-clip baselines while maintaining statistical unbiasedness, making it a significant advancement for practitioners focused on optimizing reinforcement learning processes in autonomous systems.

arXiv cs.AI34 d agofound 16 d ago#reinforcement learning#autonomous driving#sampling efficiency

Where Does the Signal Live? A Web Data Recipe for Medical Encoder Pretraining

The article presents the FineMed corpus and DoctoBERT, a state-of-the-art French medical encoder family, developed through a novel web data curation approach for pretraining encoders in dense-terminology domains. The method employs medical-term density filtering and signal-amplifying rephrasing, demonstrating that filtered web data significantly enhances performance on downstream tasks compared to traditional educational quality filters. This advancement is crucial for practitioners as it allows for the effective utilization of web-scale data in training medical NLP models, improving scalability and diversity in language representation.

arXiv cs.CL34 d agofound 13 d ago#pretraining#medical-nlp#web-data

Scaling LLM Knowledge Boundaries via Distribution-Optimized Synthesis

The paper introduces KDoS (Knowledge Distribution-optimized Synthesis), a framework for enhancing knowledge injection in Large Language Models (LLMs) by optimizing knowledge distribution during synthetic data generation. It employs a three-stage feedback mechanism to shift from traditional synthesis methods to a distribution-aware approach, demonstrating that an optimal knowledge distribution can significantly expand knowledge boundaries across models ranging from 0.6B to 16B parameters (including Qwen, Ling, and LLaMA) and varying data scales of 1B to 5B tokens. This methodology consistently outperforms existing baselines across six knowledge benchmarks, providing a new practical framework for practitioners focused on improving LLM performance through synthetic data.

arXiv cs.CL34 d agofound 13 d ago#knowledge#synthesis#llm

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

The article presents a comprehensive survey on the application of reinforcement learning (RL) in training large language models (LLMs), emphasizing the need for a structured examination of RL algorithms beyond the commonly used PPO and GRPO methods. It categorizes the RL process into three stages: MDP creation, exploration techniques, and learning strategies, highlighting underexplored areas such as off-policy actor-critic training and bootstrapping methods, which could enhance LLM training. This framework aims to guide researchers in both RL and LLMs towards more effective methodologies and identifies key opportunities for integrating established RL techniques into LLM development.

arXiv cs.AI34 d agofound 16 d ago#reinforcement learning#llm#algorithms

EvoRubrics: Dynamic Rubrics as Rewards via Adversarial Co-Evolution for LLM Reinforcement Learning

EvoRubrics introduces a co-evolutionary reinforcement learning framework that dynamically generates rubrics for evaluating a Policy LLM, enhancing training effectiveness in open-ended tasks. By allowing the Rubric Generator to adapt its criteria in real time based on the evolving policy, EvoRubrics mitigates issues of reward saturation and improves discriminative power, outperforming both static and existing dynamic rubric methods across various benchmarks. This approach demonstrates that self-supervised co-evolution can yield rich learning signals, offering a novel avenue for practitioners to optimize LLM training without reliance on external supervision.

arXiv cs.AI34 d agofound 15 d ago#llm#reinforcement-learning#dynamic-rubrics

Strengthening LLMs for Tabular Prediction with Structural Priors

The paper introduces a novel approach for enhancing large language models (LLMs) in tabular prediction by integrating structural priors through a method called Permutation Relative Policy Optimization (PRPO). This technique employs column-permutation invariance and two-level advantage estimation, resulting in a competitive 8B parameter model that outperforms traditional tabular models and even larger LLMs, achieving significant improvements in both supervised and zero-shot settings across 139 OpenML datasets. This advancement is crucial for practitioners as it demonstrates a viable pathway for adapting LLMs to excel in specialized tasks like tabular data analysis, broadening their applicability in real-world scenarios.

arXiv cs.AI34 d agofound 14 d ago#llm#tabular-prediction#optimization

Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

The Chain-of-Goals Hierarchical Policy (CoGHP) introduces a unified autoregressive framework for long-horizon offline goal-conditioned reinforcement learning, overcoming limitations of existing hierarchical methods that use separate networks and single subgoals. CoGHP employs an MLP-Mixer backbone to facilitate cross-token communication, generating a sequence of latent subgoals that condition subsequent actions. This approach has shown consistent performance improvements over strong offline baselines in navigation and manipulation tasks, making it a significant advancement for practitioners focusing on long-horizon decision-making in RL.

arXiv cs.AI34 d agofound 14 d ago#reinforcement-learning#policy#hierarchical

Priority-Aware Learning-Unlearning Correction for Dynamic Decentralized LoRA Fine-Tuning

The article presents a priority-aware learning-unlearning correction framework for decentralized federated learning (DFL) using an orthogonal LoRA mechanism, addressing the challenges of dynamic edge networks where devices frequently join or leave. This framework allows for history-free updates by providing post-training contribution coordinates, enhancing the system's ability to adaptively correct fine-tuned parameters. The proposed system includes a resource allocation algorithm to optimize communication under constraints, demonstrating effective post-event corrections through experimental validation, which is crucial for practitioners aiming to implement efficient and adaptive DFL in real-world applications.

arXiv cs.AI34 d agofound 15 d ago#llm#fine-tuning#decentralized#LoRA

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

The article introduces Randomized YaRN, a novel training method that enhances length generalization in large language models (LLMs) by integrating YaRN-based positional extrapolation with randomized positional encoding and a length curriculum. Evaluated on the BABILong and Multi-Round Coreference Resolution benchmarks, Randomized YaRN demonstrates significant improvements in reasoning performance on context lengths ranging from 16K to 128K when trained on data with less than 8K context, outperforming traditional fine-tuning methods. This approach highlights the importance of exposing models to out-of-distribution positional representations to achieve effective long-context reasoning, which is crucial for practitioners developing LLMs for tasks requiring extensive context.

arXiv cs.CL34 d agofound 13 d ago#llm#length generalization#reasoning

Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters

This article presents a novel framework utilizing SHapley Additive exPlanations (SHAP) to analyze the impact of algorithms and hyperparameters on the generalization performance of Reinforcement Learning (RL) in robotics. It establishes a theoretical link between Shapley values and RL generalizability, revealing consistent configuration impacts across various tasks, which leads to improved generalization through SHAP-guided configuration selection. This approach offers practitioners a systematic method for optimizing RL configurations, potentially enhancing deployment efficacy in real-world scenarios.

arXiv cs.AI34 d agofound 14 d ago#reinforcement-learning#generalization#robotics

On-the-Fly Adaptation to Quantization: Configuration-Aware LoRA for Efficient Fine-Tuning of Quantized LLMs

The paper introduces CoA-LoRA, a configuration-aware method for adapting LoRA adapters to various quantization settings of large language models without the need for repeated fine-tuning. It utilizes a Pareto-based configuration search to optimize a training configuration set, enabling efficient low-rank adjustments across different bit-widths. This approach significantly reduces computational costs while maintaining or improving performance compared to existing methods that require separate fine-tuning for each quantization configuration, making it valuable for deploying quantized models on edge devices.

arXiv cs.AI34 d agofound 14 d ago#fine-tuning#quantization

Hierarchical Reinforcement Learning for Sparse-Reward Search in Commutative Algebra

The article presents a novel hierarchical reinforcement learning (HRL) framework designed to tackle sparse-reward problems in commutative algebra, specifically addressing Kalai's algebraic Hirsch conjecture. It employs an options-based approach with an equivariant graph neural network policy, demonstrating superior performance over classical reinforcement learning methods and greedy search across various degrees. This work is significant for practitioners as it showcases the effective application of HRL in complex mathematical domains, potentially guiding future research in integrating AI with mathematical problem-solving.

arXiv cs.AI34 d agofound 15 d ago#reinforcement-learning#hrl#commutative-algebra

VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

The article introduces VRPO, a novel framework designed to enhance value modeling for robust reinforcement learning (RL) in the post-training phase of large language models (LLMs) under noisy supervision. VRPO improves stability and generalization by integrating auxiliary losses from a frozen language model and employing a variational information bottleneck to filter noise, transforming the value model into an active regulator of reward noise. Experimental results demonstrate that VRPO outperforms traditional methods like PPO and GRPO across various tasks, emphasizing the importance of robust value modeling in RL applications.

arXiv cs.AI34 d agofound 14 d ago#reinforcement-learning#post-training

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

The article introduces ReNIO (Reweighting Negative trajectory Importance for LLM On-Policy Distillation), a method that enhances on-policy distillation (OPD) by assigning greater importance to student-generated outputs (SGOs) that lead to incorrect reasoning, thereby preserving exploratory reasoning capabilities. By leveraging the student-to-teacher probability ratio, ReNIO effectively identifies and weights pivotal tokens from negative trajectories, improving performance on mathematical reasoning and code generation tasks, with reported gains of up to 10.00% for models like R1-Distill-Qwen-7B. This approach maintains the advantages of prefix-conditioned training while addressing the limitations of traditional OPD, making it a significant advancement for practitioners in LLM optimization.

arXiv cs.AI34 d agofound 15 d ago#llm#distillation

Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently

The paper presents a theoretical analysis comparing reinforcement learning with verifiable rewards (RLVR) to supervised fine-tuning (SFT) in enhancing reasoning capabilities of large language models. It establishes that SFT, when trained solely on optimal paths, fails to enable efficient backtracking, while RLVR facilitates learning to backtrack from dead ends using outcome rewards, resulting in significant computational efficiency during inference. This research is crucial for practitioners as it suggests that integrating RLVR can improve reasoning performance and optimize resource allocation in LLM applications.

arXiv cs.AI34 d agofound 15 d ago#llm#reinforcement-learning#reasoning

Cluster-Specific Localized Drift Detection for Efficient Batch Model Adaptation under Controlled Distribution Shift

This work presents a cluster-induced distribution shift simulation framework that enables the transformation of static tabular datasets into controlled evolving data streams, facilitating the evaluation of drift adaptation methods. Six adaptation strategies, including static learning and various retraining approaches, were assessed across five benchmark datasets for both classification and regression tasks using multiple predictive model families. This framework is significant for practitioners as it provides a structured methodology to evaluate and improve model robustness in dynamic environments where data distributions change over time.

arXiv cs.AI34 d agofound 16 d ago#drift#adaptation#datasets

MAGNIFIED: RL Fine-tuning of Multimodal Large Language Models for Motion Planning

The article presents MAGNIFIED, a reinforcement learning fine-tuning (RLFT) approach for multimodal large language models (MLLMs) aimed at improving motion planning in autonomous driving. By utilizing token-level rewards and mapping predicted tokens to vehicle trajectories, MAGNIFIED enhances planning performance, achieving over a 10.5% reduction in overlap rate and a 38.9% reduction in off-road rate compared to a supervised fine-tuning baseline on the Waymo Open Motion Dataset. This approach highlights the importance of aligning MLLM objectives with real-world planning considerations, making it a significant advancement for practitioners in autonomous vehicle development.

arXiv cs.AI34 d agofound 16 d ago#reinforcement-learning#multimodal#planning

FedSA-GCL: A Semi-Asynchronous Federated Graph Learning Framework with Personalized Aggregation and Cluster-Aware Broadcasting

FedSA-GCL is a semi-asynchronous federated graph learning framework that addresses inefficiencies in existing synchronous methods by incorporating a ClusterCast mechanism, which leverages inter-client label distribution divergence and graph topological characteristics. Evaluated on real-world graph datasets, it outperforms 10 baseline methods, achieving an average improvement of 1.9% with the Louvain algorithm and 3.0% with Metis. This framework is significant for practitioners as it enhances robustness and efficiency in federated graph learning, making it more applicable to real-world scenarios.

arXiv cs.AI34 d agofound 14 d ago#federated-learning#graph-learning

Verifiable Counterfactual Supervision for Process Reward Models

The paper introduces a method for verifiable counterfactual supervision in process reward models (PRMs), which involves generating paired correct and erroneous reasoning trajectories to identify the first unsupported transition. This approach utilizes a verified symbolic reasoning chain, injecting controlled errors at intermediate steps, and ensures coherence in subsequent reasoning. Experimental results demonstrate that this method enhances performance on logical reasoning benchmarks, improving Best-of-8 reranking and indicating potential for transfer to mathematical evaluations, which is significant for practitioners aiming to develop more robust PRMs.

arXiv cs.AI34 d agofound 14 d ago#supervision#reward-models#llm

Scaling Performance and Low-Resource Annotation with Many-Shot In-Context Learning for Named Entity Recognition

This study investigates the effectiveness of many-shot in-context learning (ICL) for Named Entity Recognition (NER), revealing that scaling to hundreds of demonstrations allows large language models (LLMs) to match or exceed the performance of fine-tuned BERT models. The authors demonstrate that using around one hundred human-labeled examples as ICL demonstrations can produce high-quality labeled data, resulting in a 10% absolute F1 improvement when fine-tuning BERT for low-resource NER tasks. This approach is significant for practitioners as it reduces the need for extensive labeled datasets while enhancing model performance in structured tasks.

arXiv cs.AI34 d agofound 16 d ago#in-context learning#ner#annotation

PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

The paper introduces PoLAR (Polar Latent Actions with Radial structure), a novel approach for latent action pretraining that separates transition extent and mode by imposing a radial structure on latent actions. By utilizing temporal offsets between observations to inform the radius of latent actions, PoLAR enhances the representation of diverse transition modes, particularly in hyperbolic space, leading to improved policy performance in both simulation and real-world robot experiments. This advancement underscores the significance of latent action space geometry in effectively transferring visual pretraining to robot policy learning tasks.

arXiv cs.AI34 d agofound 16 d ago#robot-policy-learning#latent-actions

Predicting High-Risk Colorectal Polyps in African Americans Using Pre-Colonoscopy Clinical Features: Machine Learning Model Development and Temporal Validation

The study developed and validated machine learning models to predict high-risk colorectal polyps using non-invasive pre-colonoscopy features in a predominantly African American cohort. Various algorithms, including neural networks, random forests, SVM, and XGBoost, were evaluated on a dataset of 4,681 patients for internal validation and 1,562 patients for external validation. This approach aims to improve risk stratification and equitable access to surveillance, potentially optimizing resource allocation in healthcare settings with limited colonoscopy availability.

arXiv cs.AI34 d agofound 16 d ago#risk prediction#machine learning#colorectal polyps

Entropy Objectives in Markov Decision Processes

The paper presents a formal approach to synthesizing control policies in Markov Decision Processes (MDPs) that maintain an entropy-based objective, highlighting the complexity of even relaxed versions of this problem. It introduces a sound and conditionally complete method for verifying and synthesizing strategies, leveraging convex duality and invariant synthesis to tackle the non-linear nature of entropy objectives. This work is significant for practitioners as it provides a framework for implementing entropy constraints in stochastic systems, potentially enhancing the robustness and performance of AI decision-making processes.

arXiv cs.AI34 d agofound 20 d ago#mdp#entropy#control

Repeated post-training is not Self-improving: Diagnosing Scientific Amnesia in Continual DPO Pipelines

The paper introduces the concept of "scientific amnesia" in continual DPO training pipelines for LLMs, where models fail to accumulate reusable knowledge despite preserving learned behaviors. It presents a diagnostic suite for identifying this issue, a pipeline utilizing FSDP-sharded DPO checkpoints on Qwen2.5-7B-Instruct, and a benchmark based on 30 campaigns of HumanEval. The findings indicate that while many strategies degrade model performance, a conservative rule-based scheduling approach shows improvement, highlighting the need for tailored interventions based on specific training conditions and evaluation designs.

arXiv cs.AI34 d agofound 20 d ago#self-improvement#llm#training dynamics

Attention-Spectrum Regularization for Replay-Free Continual Multimodal LLMs

The article presents Attention-Spectrum Regularization (ASR), a novel framework for replay-free continual learning in multimodal large language models (MLLMs) that preserves skill-conditioned structures of cross-modal attention. ASR utilizes spectral statistics of cross-attention maps to maintain prototypes of skills, effectively controlling the drift of these prototypes during adaptation to new tasks. Experimental results on benchmarks such as VQA v2 and CoIN demonstrate that ASR enhances performance and reduces forgetting compared to existing methods, making it a lightweight solution for practitioners working with continual learning in MLLMs.

arXiv cs.AI34 d agofound 15 d ago#continual learning#multimodal#llm

When Is an LLM Worth It for Hyperparameter Optimization? A Budget-Matched Study on Tabular Data Finds the Warm-Start Is a Default Configuration, Not the Model

The study evaluates the efficacy of large language models (LLMs), specifically LLM-OptFlow, as hyperparameter optimization (HPO) advisors on tabular data across eight benchmarks. It finds that the initial strong performance of the LLM is primarily due to a fixed default configuration rather than the model's outputs, yielding only marginal improvements in cross-validation accuracy. For practitioners, the results suggest that classical search methods seeded with sensible defaults may be more effective than LLM-based approaches for HPO in tabular settings, as LLMs do not provide significant generalization benefits and can be outperformed in a limited number of evaluations.

arXiv cs.AI34 d agofound 16 d ago#hyperparameter optimization#llm#tabular data

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

This paper presents a novel approach to safe exploration in hierarchical reinforcement learning by integrating a learnable world model with a high-level policy that sets subgoals and a low-level policy that uses imagined rollouts to navigate safely. The method demonstrates significant improvements over existing Safe RL baselines in long-horizon navigation and manipulation tasks, achieving higher success rates and better adherence to safety constraints. This advancement is crucial for practitioners as it addresses the limitations of current safe exploration techniques in complex environments, enabling more reliable deployment of RL agents in safety-sensitive applications.

arXiv cs.AI34 d agofound 20 d ago#reinforcement_learning#safe_exploration#hierarchical

RECALL: Recovery Experience Collection for Active Lifelong Learning in Vision-Language-Action Models

The paper presents a novel active continual learning paradigm for Vision-Language-Action (VLA) models, emphasizing the benefits of uncertainty-guided data collection over traditional passive imitation learning. Key findings include improved fine-tuning efficiency when using actively collected recovery data, though this approach risks catastrophic forgetting if not managed properly. Techniques such as replay-based data mixing and elastic weight consolidation are evaluated, highlighting the trade-offs between adapting to new data and retaining previously learned behaviors, which is crucial for practitioners developing robust VLA systems.

arXiv cs.AI34 d agofound 15 d ago#lifelong learning#vision-language

SLeDGe: Semi-Supervised Learning on Data Streams with Graph Structure Learning

SLeDGe is a novel semi-supervised learning (SSL) method designed for data streams that integrates adaptive graph structure learning with a predictive model under strict memory and labeling constraints. It features distinct update strategies for maintaining compact labeled and unlabeled memories and encourages sparsity in the relational graph to enhance label supervision propagation. Evaluated across 12 datasets, SLeDGe demonstrates significant performance improvements, achieving average relative accuracy gains of 31.7% with only 0.1% labeled data and 14.8% with 1% labeled data, making it a valuable tool for practitioners dealing with evolving data streams.

arXiv cs.AI34 d agofound 16 d ago#semi-supervised-learning#data-streams#graph-structure

Subspace-Constrained Federated Learning with Low-Rank Adaptation

The paper presents a subspace-regularized federated low-rank adaptation (LoRA) method that addresses geometric misalignment in federated learning, which can hinder convergence and aggregation. The proposed method was empirically evaluated on RoBERTa-large and SmolLM-360M models, demonstrating superior performance over FedAvg and other baselines, achieving the highest accuracy and lowest loss metrics, as well as near-perfect basis overlap. This work is significant for practitioners as it enhances the efficiency of fine-tuning large models in federated settings, particularly under conditions of data heterogeneity.

arXiv cs.AI34 d agofound 15 d ago#federated learning#low-rank adaptation

Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

The paper presents a systematic study on Knowledge Distillation (KD) in the post-training phase, specifically utilizing the large-scale Tulu 3 dataset. It finds that KD outperforms supervised fine-tuning in low-data scenarios, with diminishing returns as data increases; however, distillation from a strong instruction-tuned teacher can enhance performance even with abundant data. The authors propose a two-stage KD strategy that combines synthetic teacher-labeled data and human refinement, offering a practical approach for developing efficient models in resource-constrained settings.

arXiv cs.CL34 d agofound 13 d ago#knowledge#distillation#llm

Evaluating Document-Tuned Transformer Representations for Person-level Mental Health Assessment

The study evaluates document-tuned transformers against base-transformers for person-level mental health assessment, revealing that document-tuned models, which are further fine-tuned at the document level, achieve a 13.4% increase in Pearson correlation (p=.015) across two psychological datasets. Robustness tests indicate these models maintain higher accuracy under various perturbations and better capture hedged language, suggesting they are more effective for predicting mental health outcomes. This highlights the importance of model representation choice in enhancing the reliability of AI-driven psychological assessments.

arXiv cs.CL34 d agofound 13 d ago#mental health#transformers#document

Steer, Don't Solve: Training Small Critic Models for Large Code Agents

The paper introduces a novel approach to enhancing large code agents by incorporating a small critic model that provides intra-trajectory feedback through Supervised Fine-Tuning, rather than relying on post-hoc evaluations. The critic, trained on CWM-32B trajectories, demonstrates significant performance improvements on SWE-bench Verified, achieving gains of up to +5.2 points on Qwen agents while reducing training costs by 30-92 times compared to traditional methods. This approach highlights the potential for smaller, specialized models to optimize training efficiency and accuracy in large-scale code generation tasks.

arXiv cs.AI34 d agofound 16 d ago#critic-models#code-agents#feedback

Generalization of Fine-Tuned Uncertainty Communication and Metacognition in Large Language Models

This study investigates the impact of supervised fine-tuning on the uncertainty communication capabilities of large language models. Two models were fine-tuned on diverse tasks, showing improved alignment between confidence levels and answer correctness, particularly in single-question confidence estimation and pairwise comparisons. The findings suggest that while fine-tuning enhances metacognitive performance, the transfer of skills between different confidence tasks is limited, indicating the potential benefit of multitask training for broader applicability in AI applications.

arXiv cs.AI34 d agofound 14 d ago#llm#fine-tuning#uncertainty#metacognition

OFMU: Optimization-Driven Framework for Machine Unlearning

The article presents OFMU, a penalty-based bi-level optimization framework designed for machine unlearning, which allows large language models to remove specific knowledge while maintaining performance on remaining data. OFMU employs a hierarchical structure with an inner maximization step that incorporates a similarity-aware penalty to mitigate conflicting gradients, and an outer minimization step to restore model utility. The framework demonstrates improved forgetting efficacy and retained utility compared to existing methods, with extensive experimental validation across various vision and language benchmarks, making it a significant advancement for practitioners needing effective unlearning capabilities in sensitive applications.

arXiv cs.AI34 d agofound 14 d ago#machine-unlearning#llm

Distribution-Aware Diffusion-LLM for Robust Ultra-Long-Term Time Series Forecasting

The article introduces Diffusion-LLM, a novel framework that integrates a conditional diffusion model with a large language model (LLM) for improved ultra-long-term time series forecasting. Evaluated on six benchmarks, including ETT and Weather, Diffusion-LLM demonstrates significant performance improvements over existing LLM-based methods, particularly in few-shot scenarios, by enhancing probabilistic modeling and semantic alignment in a shared latent space. This advancement is crucial for practitioners seeking robust and generalizable solutions in multimodal time series forecasting using LLMs.

arXiv cs.AI34 d agofound 15 d ago#time series#forecasting#llm

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

The article discusses an open problem regarding the effectiveness of the AdamW optimizer under heavy-tailed noise, which is common in large language model (LLM) pretraining. While AdamW is widely used, its theoretical foundation remains largely untested in this context, contrasting with recent findings that sign-based optimizers like Lion and Muon perform well under similar conditions. The authors propose a rigorous inquiry into AdamW's convergence properties under heavy-tailed assumptions and present preliminary results, including a positive benchmark and a mechanism highlighting how denominator memory can obscure large gradients, which is crucial for practitioners seeking to optimize LLM training in noisy environments.

arXiv cs.AI34 d agofound 15 d ago#adamw#optimization#heavy-tailed noise

When Compression Helps and When It Hurts: Condition-Aware Analysis of Chain-of-Thought Distillation

The paper presents a comprehensive analysis of Chain-of-Thought (CoT) distillation, focusing on the effectiveness of compression methods like selective pruning and generative rewriting. It identifies that the utility of importance criteria is influenced by granularity, with step-level criteria sharing a reasoning backbone and token-level pruning needing symbol-aware signals. The study also reveals that restructuring impacts performance differently across domains and that savings in training-time compression do not always equate to reduced inference costs, providing practitioners with condition-aware guidelines for effective model deployment.

arXiv cs.CL34 d agofound 13 d ago#chain-of-thought#distillation#compression

The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

A new framework for modeling the energy consumption of Transformer training on multiple GPUs has been introduced, focusing on BERT models. This framework utilizes architectural sweeps to correlate energy usage with proxies for compute, memory traffic, and hardware efficiency, incorporating a roofline-inspired model that accounts for tensor parallelism and fully sharded data parallelism. This approach is significant for practitioners as it provides a predictive model for energy costs, aiding in the design of sustainable and cost-effective AI systems.

arXiv cs.AI34 d agofound 15 d ago#energy#transformers#scaling

TIF: Learning Temporal Invariance in Android Malware Detectors

The paper introduces TIF, a novel temporal invariant training framework designed to improve the stability of representations in Android malware detectors facing distribution drift. TIF utilizes multi-proxy contrastive learning and invariant gradient alignment to effectively manage temporal drift by organizing environments based on application observation dates. Experimental results demonstrate that TIF significantly enhances detection performance, especially during early deployment phases, outperforming existing state-of-the-art methods and addressing critical challenges in malware detection.

arXiv cs.AI34 d agofound 13 d ago#malware#detection#temporal

Fara-1.5: Scalable Learning Environments for Computer Use Agents

Fara-1.5 introduces a scalable data pipeline, FaraGen1.5, designed for training computer use agents (CUAs) through modular components: environments, solvers, and verifiers. The pipeline utilizes both live and synthetic environments, powered by models like GPT-5.4, and employs a supervised finetuning approach to produce three variants of Fara1.5 (4B, 9B, and 27B). The 9B and 27B models achieve state-of-the-art performance on browser-use benchmarks, with scores of 63.4% and 72.3% on Online-Mind2Web, respectively, demonstrating significant advancements in efficiency and task correctness for practitioners developing LLM-based agents.

arXiv cs.AI34 d agofound 20 d ago#data generation#agents#workflow

Negative Knowledge as Failure-aware Shared Memory for AutoResearch

The article presents a novel approach called the negative knowledge memory layer, which allows AI-assisted research systems to retain and utilize information from failed experiments as structured knowledge. Evaluated on ScienceAgentBench and nonlinear math-physics PDE problems, this method demonstrated improved performance over traditional AutoResearch baselines, using fewer tokens and enabling agents to solve previously unsolvable tasks. This advancement highlights the importance of maintaining a comprehensive knowledge repository that includes both successes and failures, enhancing the overall efficacy of AI in scientific research.

arXiv cs.AI34 d agofound 20 d ago#negative knowledge#auto research#failure awareness

Learning Burst-Aware Early Warning Models for Capacity Stress under AI Workload Surges in Hyperscale Data Centers

The paper introduces a burst-aware early warning framework designed for predicting capacity stress in hyperscale data centers under AI workload surges, particularly from large language models. It employs a lightweight XGBoost model, achieving an ROC AUC of 0.697 and a Recall of 0.914, which enables proactive operational interventions before system degradation occurs. This framework is significant for practitioners as it enhances the resilience of data center operations by integrating predictive analytics into workload management strategies, addressing the unique demands of AI-driven jobs.

arXiv cs.AI34 d agofound 20 d ago#capacity stress#ai workload#data centers

LAYUP: Asynchronous decentralized gradient descent with LAYer-wise UPdates

LAYUP introduces an asynchronous decentralized stochastic gradient descent (SGD) method with layer-wise updates, designed to mitigate the communication overhead associated with synchronous, centralized training. By utilizing randomized gossip communication, LAYUP allows for immediate application of layer-wise updates during backpropagation, achieving convergence up to 32% faster than traditional synchronous data parallel training and 27% faster than existing communication-efficient algorithms, while improving robustness against stragglers. This approach enhances model FLOPs utilization and provides a viable solution for practitioners seeking efficient distributed training without compromising accuracy.

arXiv cs.AI34 d agofound 13 d ago#distributed#training#sgd

Enhancing Cognitive Workload Classification Using Integrated LSTM Layers and CNNs for fNIRS Data Analysis

The paper presents a deep learning model that integrates Long Short-Term Memory (LSTM) layers with Convolutional Neural Networks (CNNs) to improve cognitive workload classification using functional near-infrared spectroscopy (fNIRS) data. The study reports an increase in classification accuracy from 97.40% to 97.92% by addressing issues of spatial feature overfitting and temporal dependency. This advancement is significant for practitioners as it enhances the ability to accurately assess cognitive states, potentially improving applications in neuroergonomics and cognitive load monitoring.

arXiv cs.AI34 d agofound 13 d ago#cognitive#workload#classification

LangMAP: A Language-Adaptive Approach to Tokenization

LangMAP introduces a language-adaptive tokenization method that enhances the UnigramLM algorithm for multilingual applications, allowing for language-specific tokenization using a single shared vocabulary. This approach enables the adaptation of pretrained model tokenizers without altering their vocabulary and performs language-specific tokenization at inference without prior language knowledge. The method demonstrates improved morphological boundary alignment across various languages and coding contexts, although its effectiveness in knowledge-related tasks shows variability.

arXiv cs.CL34 d agofound 13 d ago#llm#tokenization#language adaptive

Cohort Organized Learning: Clustering Through Agreement

Cohort Organized Learning (CoOL) is a novel clustering method that operates without explicit distance or similarity computations, utilizing neural networks to estimate clusters instead. The paper details the derivation of gradients via expectation maximization for training, convergence monitoring techniques, and evaluation of clusters post-training, with applications demonstrated on vector data and images. This approach offers a flexible clustering solution for practitioners by enabling the handling of diverse data types while addressing potential limitations and future applications.

arXiv cs.AI34 d agofound 16 d ago#clustering#neural-networks#cohort-learning

Learning Process Rewards via Success Visitation Matching for Efficient RL

The paper introduces a novel method for transforming sparse rewards in reinforcement learning (RL) into dense process rewards using a discriminator to differentiate between successful and unsuccessful episodes. This technique incentivizes the policy to match the state-action visitations of successful episodes, facilitating faster training without altering the optimal policy. The approach significantly improves finetuning performance in robotic control tasks, demonstrating its practical relevance for practitioners aiming to enhance RL efficiency in sparse reward scenarios.

arXiv cs.AI34 d agofound 15 d ago#reinforcement learning#sparse rewards

Test-Time Training with Next-Token Prediction

The paper introduces Test-Time Training with Next-Token Prediction (TTT-NTP), a method that enables fast-weight adaptation in pretrained long-context language models without requiring modifications to the model architecture. TTT-NTP uses the model's next contextual hidden state to supervise updates, allowing it to leverage the self-supervised next-token prediction signal effectively. Benchmark results show that TTT-NTP improves performance on RULER Full-13 across models like Llama-3.1-8B and Mistral-7B-v0.3, as well as on the LongBench-v2 QA benchmark, making it a valuable technique for practitioners seeking to enhance the capabilities of existing LLMs.

arXiv cs.CL34 d agofound 13 d ago#test-time training#llm#next-token prediction

The Score Granularity Gap in Black-Box LLM Classification: A Comparative Study of Confidence Constructions

This study examines the "score granularity gap" in black-box LLM classification, analyzing seven methods for constructing confidence scores across 25 model-dataset pairs involving 9 LLMs. It finds that while single-shot verbalized confidence can effectively rank cases, it offers limited threshold granularity, which impacts decision-making in deployment. The research provides insights into how different confidence constructions affect model performance and inference costs, offering practical guidance for practitioners on optimizing confidence score usage in LLM applications.

arXiv cs.CL34 d agofound 13 d ago#llm#confidence-scores#black-box

Learning at the Right Pace: Adaptive Data Scheduling Improves LLM Reinforcement Learning

The article presents Adaptive Data Scheduling (ADS), a dual-level framework designed to enhance reinforcement learning (RL) post-training for Large Language Models (LLMs) by replacing uniform data sampling with an adaptive approach based on semantic clusters and policy boundaries. Experimental results show that ADS improves average accuracy by 5.2% over Group Relative Policy Optimization (GRPO) across three LLMs and seven reasoning benchmarks, indicating its effectiveness as a versatile data scheduling strategy for practitioners in LLM reinforcement learning. The source code for ADS is publicly available on GitHub.

arXiv cs.CL34 d agofound 13 d ago#reinforcement-learning#llm#data-scheduling

P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist

P-Check is a new personalized reward modeling framework that introduces a dynamic checklist generator for evaluating user preferences, addressing the limitations of static user context in existing models. It employs a Preference-Contrastive Criterion Weighting strategy to prioritize evaluation criteria based on their relevance to individual judgments. The framework shows improved reward accuracy and performance in downstream personalized generation tasks, particularly in out-of-distribution scenarios, making it significant for practitioners focused on enhancing user alignment in AI systems.

arXiv cs.CL34 d agofound 13 d ago#reward modeling#personalization#llm

Machine Learning Classification of Cryopathy Syndromes: A Comprehensive Comparative Study

The study presents a comparative analysis of machine learning techniques for the classification of cryopathy syndromes using laboratory data from 2,686 patients across 14 diagnostic categories. Twelve modeling strategies were evaluated, with a soft-voting ensemble of Random Forest and Gradient Boosted Trees achieving the best multiclass performance, while tree-based methods outperformed neural networks. This work highlights the importance of feature engineering in improving classification accuracy and provides a potential framework for clinical decision support in a challenging diagnostic landscape characterized by class imbalance and overlapping symptoms.

arXiv cs.AI34 d agofound 16 d ago#ml#classification#cryopathy