Inference — AI news — AI News Digest

OpenAI and Broadcom unveil LLM-optimized inference chip

OpenAI and Broadcom have announced the Jalapeño chip, a custom AI inference chip specifically designed for large language model (LLM) optimization. This chip aims to enhance performance, efficiency, and scalability in AI systems, potentially benefiting practitioners working with LLMs by providing a dedicated hardware solution for inference tasks.

Reddit r/LocalLLaMA32 d agofound 12 d ago#openai#broadcom#llm#chip

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

UC San Diego's DFlash introduces a block diffusion model for speculative decoding, allowing for the drafting of whole token blocks in a single forward pass with key-value (KV) injection for conditioning. The model achieves a reported 6.08x lossless speedup on the Qwen3-8B model and up to 15x throughput on NVIDIA's Blackwell architecture, while supporting frameworks like SGLang, vLLM, and TensorRT-LLM. This advancement is significant for practitioners as it enhances decoding efficiency and throughput, which are critical for real-time applications in AI.

MarkTechPost33 d agofound 12 d ago#speculative_decoding#throughput#nvidia

OpenAI and Broadcom unveil LLM-optimized inference chip

OpenAI News33 d agofound 12 d ago#openai#broadcom#llm#chip

Efficient Test-time Inference for Generative Planning Models with OCL Search

This paper presents an optimized inference method for generative planning models using a modified Open-Closed List (OCL) search algorithm. The approach integrates a generative model for rapid rollouts and a heuristic model for prioritizing reasoning paths, resulting in improved computational efficiency and solution quality across various combinatorial planning domains. This advancement is significant for practitioners as it enhances the performance of generative models without requiring extensive computational resources during inference.

arXiv cs.AI33 d agofound 10 d ago#inference#planning#OCL

Lightweight Transformer Models for On-Device Fault Detection: A Benchmark Study on Resource-Constrained Deployment

This study benchmarks lightweight transformer models (DistilBERT, TinyBERT-6L, TinyBERT-4L, MobileBERT) against traditional ML methods (Random Forest, XGBoost, SVM, Logistic Regression) for on-device fault detection across three datasets. Key findings indicate that TinyBERT-4L offers a favorable trade-off with a model size of 55 MB and a CPU inference latency of 18 ms, while INT8 quantization can reduce model size by 25% with minimal impact on classification performance (86.9% F1). The results underscore the challenges of deploying accurate models in resource-constrained environments, particularly in scenarios with extreme class imbalance.

arXiv cs.AI33 d agofound 10 d ago#fault detection#transformers#benchmark

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

CrossPool is a new serving engine designed for cold mixture of experts (MoE) models, addressing GPU memory inefficiencies by separating feedforward network (FFN) weights and key-value (KV) caches into distinct memory pools. This architecture allows for dynamic KV-cache allocation based on active demand while consolidating FFN weights across multiple models, significantly improving GPU memory utilization and supporting long-context requests. CrossPool demonstrates a performance improvement over existing KV-cache-based multi-LLM serving systems, achieving up to a 10.4x reduction in P99 tail latency, which is crucial for practitioners aiming to optimize resource allocation and response times in LLM deployments.

arXiv cs.AI33 d agofound 10 d ago#llm#serving#memory

FP8 is All You Need (Part 2): Efficient Ozaki-Bailey Style FFT Through Tensor-core Garner Reformulation and Kulisch Escape Route

NVIDIA's Blackwell Ultra (B300) GPU introduces a novel approach to achieving FP64-equivalent throughput for 3-D FFTs by utilizing FP8 tensor cores through the Ozaki-Bailey FFT framework. This method leverages a mantissa-sliced Chinese-remainder reconstruction and integrates Kulisch fixed-point arithmetic to maintain FP64 accuracy while operating on INT32, with projected performance for 1024^3 FFTs at approximately 18 ms. This advancement is significant for practitioners, as it enables efficient utilization of lower-precision computations in memory-bound workloads, paving the way for a dedicated libKulisch library and benchmarking efforts.

arXiv cs.AI33 d agofound 10 d ago#fft#tensor-core#optimization

CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

CompressKV is a newly proposed framework for compressing key-value (KV) caches in long-context large language models (LLMs), specifically targeting GQA-based architectures. It introduces the concept of Semantic Retrieval Heads (SRHs) to selectively retain critical tokens based on their semantic importance, significantly improving resource efficiency. In experiments, CompressKV maintained over 97% of full-cache performance using only 3% of the KV cache on LongBench and achieved 90% accuracy with just 0.7% KV storage on Needle-in-a-Haystack, highlighting its potential for optimizing memory usage in LLM inference.

arXiv cs.AI33 d agofound 12 d ago#kv-cache#llm#compression

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

The paper introduces Cavewoman, a two-channel evaluation protocol for assessing the effects of linguistic input and output compression on large language models (LLMs). It evaluates eight models across five datasets, revealing that output compression can reduce inference costs by 1.4-2.4x, while input compression typically increases costs by 1.15x on average, leading to longer, less accurate responses. This research highlights the importance of carefully managing compression strategies in LLM applications, as input compression may degrade performance and inflate operational expenses.

arXiv cs.AI33 d agofound 10 d ago#compression#llm#cost reduction

Zero-Shot Test-Time Canonicalization using Out-of-Distribution Scoring

The paper presents a novel approach to zero-shot test-time canonicalization that addresses the misclassification of inputs transformed by affine operations in pretrained vision models. By reframing canonicalization as out-of-distribution (OOD) detection, the authors explore various OOD scoring functions and optimization algorithms, finding that distance-based scores combined with random search and local refinement yield the best performance across diverse benchmarks. This method allows practitioners to improve model robustness without altering the classifier architecture or retraining, thus preserving in-distribution accuracy while enhancing performance on transformed inputs.

arXiv cs.AI33 d agofound 10 d ago#canonicalization#ood detection#vision models

I'm eager for a 15x speedup on my strix halo

Nvidia has announced the potential for a 15x speedup in processing using a diffusion model that generates an entire block of text at once. This improvement could significantly enhance the performance of applications relying on text generation, making it relevant for practitioners seeking efficiency in large language model deployments. The specifics of the model size and architecture changes were not disclosed, but the implications for faster inference times could impact real-time applications in AI.

Reddit r/LocalLLaMA33 d agofound 21 d ago#nvidia#speedup#diffusion

GLM 5.2 on Mac Studio Speedup PR

GLM 5.2 has been optimized for Mac Studio, achieving prefill speeds exceeding 100 tokens per second (t/s) while accommodating larger contexts. This update allows for 4-bit quantization with context sizes over 100,000 tokens, enhancing performance and efficiency for practitioners working with large language models. The improvements are detailed in a pull request by the oMLX creator, indicating significant advancements in model deployment on Mac hardware.

Reddit r/LocalLLaMA33 d agofound 21 d ago#glm#mac#speedup

CPU-only TTS benchmark: Kokoro 82M vs Supertonic 3 vs Inflect-Nano-v1 (4.6M params), with UTMOS scoring on every sample

A benchmark comparing three CPU-only TTS models—Kokoro (82M params), Supertonic 3, and Inflect-Nano-v1 (4.6M params)—was conducted using Intel Xeon hardware with UTMOS scoring for audio quality. Results indicated that while Inflect-Nano achieved the fastest real-time factor (RTF) of 7.3x, its audio quality was rated poorly (MOS 3.48) due to issues with naturalness, whereas Kokoro provided the most human-like output (MOS 4.44) albeit at a slower RTF. The findings are significant for practitioners as they highlight trade-offs between speed and audio quality in TTS systems, with Kokoro being the most suitable for applications requiring natural-sounding speech.

Reddit r/LocalLLaMA33 d agofound 21 d ago#tts#benchmark#cpu

Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference

The paper presents a novel approach to speculative decoding for low-latency inference of large language models (LLMs) in edge-cloud environments, introducing a method called UCB-SpecStop that dynamically adjusts the draft length based on communication delays. The study formulates the draft length tradeoff as an optimal stopping problem and establishes a state-dependent threshold policy for varying network conditions, demonstrating that UCB-SpecStop can reduce per-token latency by up to 22.4% compared to existing methods. This advancement is significant for practitioners as it enhances the efficiency of LLM inference in real-time applications by optimizing communication resources and adapting to network variability.

arXiv cs.AI34 d agofound 20 d ago#llm#speculative#decoding

Recency/Frequency Adaptive KV Caching for Large Language Model Serving

The article presents a novel adaptive key-value (KV) caching strategy for large language model inference, which improves cache management by dynamically allocating space between recently and frequently accessed KV blocks. This approach enhances the KV cache hit rate by up to 10.8% and reduces time to first token by up to 12.6% on synthetic workloads, and by 2.1% and 2.0% on real-world conversation tasks compared to naive vLLM. This advancement is significant for practitioners as it optimizes inference efficiency and accommodates diverse workloads, addressing limitations of traditional caching methods like least-recently-used (LRU).

arXiv cs.AI34 d agofound 16 d ago#kv_caching#llm_serving

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

HyperQuant is a new post-training quantization pipeline designed for large language and diffusion models, achieving optimal rate-distortion performance. It demonstrates superior results compared to existing methods like HIGGS, TurboQuant, and OCTOPUS, with weight compression of approximately 3.9x and KV cache compression of 3.79x at 4 bps while maintaining near-lossless quality. The pipeline integrates a Randomized Hadamard Transform, low-dimensional optimal lattice quantization, and advanced coding techniques, making it relevant for practitioners aiming to enhance model efficiency without sacrificing performance.

arXiv cs.AI34 d agofound 15 d ago#quantization#llm

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

GRINQH (GRaded INput-based Quantization Hierarchy) is a weight-only post-training quantization framework designed to enhance LLM decoding efficiency by addressing the asymmetry between compute-bound prefill and memory-bound decoding stages. It dynamically assigns weight channels to varying precision levels based on activation magnitudes, allowing for flexible average bit widths, and has demonstrated superior performance on Llama3 and Qwen3 models, achieving effective 2-bit generation while surpassing existing fixed and mixed-precision methods. This development is significant for practitioners as it establishes a new Pareto frontier for balancing generation quality and inference speed, particularly in resource-constrained environments.

arXiv cs.AI34 d agofound 15 d ago#quantization#llm#performance

Translating Inference-Time Control to Radiology Vision-Language Models: Activation Steering for Pneumonia Classification on Chest X-rays

The study evaluates the effectiveness of Contrastive Activation Addition (CAA) for enhancing pneumonia classification in three frozen vision-language models (VLMs): MedGemma-4B-IT, NV-Reason-CXR-3B, and CheXOne-3B, using the Kermany pneumonia test set. Notably, NV-Reason-CXR-3B showed significant performance improvement in F1 score from 0.7692 to 0.8727 with image-conditioned steering, while CheXOne-3B demonstrated a smaller increase. This research suggests that activation steering can effectively modify VLM behavior for medical diagnostics without requiring model weight updates, offering a potentially lightweight solution for practitioners in medical AI applications.

arXiv cs.AI34 d agofound 16 d ago#medical#vlm#classification

Fix the Structural Bottleneck: Context Compression via Explicit Information Transmission

The paper introduces ComprExIT, a novel context compression framework designed to enhance the efficiency of long-context LLMs by addressing structural bottlenecks in existing LLM-based compressors. ComprExIT improves contextual information preservation through adaptive feature selection and a globally coordinated transport plan, yielding up to 18.5% better average F1 scores across 12 datasets while only increasing trainable parameters by approximately 1% and achieving over 2x faster compression than current leading methods. This advancement is significant for practitioners looking to optimize the performance and deployment of LLMs with long context capabilities.

arXiv cs.CL34 d agofound 12 d ago#context compression#information transmission

LLM-Aided A* Search in Non-Geometric Network Graphs

The article introduces an LLM-aided A* search algorithm designed for non-geometric network graphs, where edge weights represent metrics such as latency or cost. By utilizing large language models to generate intermediate waypoints and employing landmark distances as an admissible heuristic, the approach significantly reduces the number of expanded nodes by approximately 50% with only a slight increase in path cost. This research highlights the effectiveness of integrating LLMs with traditional search algorithms, offering a promising avenue for enhancing network optimization strategies in scenarios lacking geometric distance information.

arXiv cs.AI34 d agofound 14 d ago#shortest-path#non-geometric#llm

Executing as You Generate: Hiding Execution Latency in LLM Code Interpreters

The article introduces EAGER, a novel code execution framework for LLMs that allows for parallel execution of code generation and execution, significantly reducing end-to-end latency. By implementing a three-stage pipeline of generation, detection, and execution, EAGER utilizes AST-based chunking and dynamic batching to achieve up to 99.8% reduction in non-overlapped execution time and up to 37.3% reduction in overall latency across various benchmarks and LLMs. This advancement is crucial for practitioners aiming to enhance the efficiency of LLM-based applications that require real-time code execution.

arXiv cs.AI34 d agofound 14 d ago#LLM#code interpreter#execution latency

OpenWER: Improving Cross-Lingual ASR Evaluation and Enabling Token-Based Accuracy Metrics

OpenWER is an open-source tool designed to enhance the robustness of Word Error Rate (WER) in cross-lingual Automatic Speech Recognition (ASR) evaluations. It introduces language-specific normalization and compound word detection, along with a token-based Levenshtein alignment that allows for more granular accuracy metrics, resulting in WER reductions of up to 25% across 52 languages compared to existing libraries. This advancement is significant for practitioners as it promotes fairer evaluations in ASR research, particularly for low-resource languages, thereby improving the reliability of multilingual models.

arXiv cs.CL34 d agofound 13 d ago#asr#evaluation#cross-lingual

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

The paper introduces LUQ (Layerwise Ultra-Low Bit Quantization), a novel method for ultra-low-bit quantization (<4-bit) of multimodal large language models (MLLMs), addressing the challenge of high memory and computational demands. LUQ leverages the varying entropy of activations across transformer layers to selectively apply quantization, resulting in models like LLaVA-1.5 and Qwen-2.5-VL using 40% and 31% less memory than standard 4-bit models, with less than 10% performance degradation on multimodal evaluation benchmarks. This approach is significant for practitioners as it enables more efficient deployment of MLLMs without substantial loss in performance.

arXiv cs.AI34 d agofound 14 d ago#quantization#multimodal-llm

Answer Engineering: Local Trajectory Editing for Protocol-Constrained Decision Making in Large Language Models

The paper introduces "Answer Engineering," a deterministic runtime intervention method for large language models that enhances protocol compliance in decision-making without requiring model retraining. Evaluated on a clinical benchmark for sudden sensorineural hearing loss (SSNHL), the approach improved compliance from 54.5% to 83.5% and increased balanced accuracy from 42.0% to 80.7% by applying localized rule-guided interventions during autoregressive generation. This method is significant for practitioners as it enables more reliable outputs in critical applications by addressing procedural adherence through auditable control mechanisms.

arXiv cs.AI34 d agofound 20 d ago#decision making#llm#answer engineering

Explanations for Automatic Speech Recognition

The paper presents a novel approach to quality assessment in neural network-based Automatic Speech Recognition (ASR) systems by generating explanations for transcriptions, which enhance understanding and trust in the models. It introduces a method that identifies a minimal and sufficient subset of audio frames responsible for a given transcription, adapting techniques such as Statistical Fault Localization (SFL) and Causal explanations, along with an adapted version of LIME for ASR. Evaluations conducted on ASR models including Google API, Sphinx, and Deepspeech using the Commonvoice dataset demonstrate the effectiveness of the proposed explanation techniques, which are crucial for practitioners seeking to improve interpretability and reliability in ASR systems.

arXiv cs.AI34 d agofound 13 d ago#asr#explanations#xai

Adding Robust Code-Switching Capabilities to High Performance Multilingual ASR

The paper introduces a method for enhancing multilingual automatic speech recognition (ASR) systems with robust code-switching (CSW) capabilities through Bayesian factorized adaptation. This approach improves transcription accuracy for code-switched words by 32.87% and overall word error rate (WER) by 5.31%, while preserving monolingual performance, indicating that effective CSW adaptation relies on knowledge integration rather than merely increasing data complexity. This advancement is significant for practitioners aiming to deploy ASR systems in multilingual environments where code-switching is prevalent.

arXiv cs.CL34 d agofound 13 d ago#asr#code-switching#multilingual

KaLM-Reranker-V1: Fast but Not Late Interaction for Compressed Document Reranking

KaLM-Reranker-V1 is a new fast but not late-interaction (FBNL) document reranker that decouples query and passage computation using an encoder-decoder architecture. It features three model sizes (Nano: 0.27B, Small: 1B, Large: 4B parameters) and employs Matryoshka embedding pooling for efficient passage encoding, maintaining strong relevance modeling through cross-attention. Benchmark results on BEIR, MIRACL, and LMEB show that KaLM-Reranker-V1 achieves state-of-the-art performance, rivaling larger models while offering significant efficiency advantages, making it a valuable tool for practitioners in retrieval systems.

arXiv cs.CL34 d agofound 13 d ago#reranking#document#efficiency

Enabling Cloud-Level Accuracy in Edge AI through IoT Data Preprocessing

This paper presents a structured prompt construction framework that enhances the accuracy of local LLMs in interpreting IoT sensor data by preprocessing raw measurements into enriched textual representations. Evaluated on datasets from Raspberry Pi and various cities, results indicate that local model accuracy improved significantly, with indoor accuracy rising from 50.9% to 81.7% and outdoor from 63.7% to 89.3% when using enriched prompts. This approach addresses latency and performance issues in edge AI deployments, making it a valuable technique for practitioners seeking to optimize real-time analytics in smart environments.

arXiv cs.AI34 d agofound 15 d ago#llm#iot#data#preprocessing

Human-Less LLM Serving: Quantifying the Human Tax on Throughput

This study quantifies the throughput loss in LLM serving systems due to human-centric latency metrics (TTFT and TPOT) when applied to long-horizon AI tasks that operate without human supervision. The research reveals that the "human tax" on throughput can range from 60-93% at 64K token contexts, particularly under high concurrency and tighter SLAs. The authors advocate for workload-class-aware SLA configurations to optimize performance for non-human tasks, suggesting that current serving systems may unnecessarily constrain throughput by uniformly applying human-focused metrics.

arXiv cs.AI34 d agofound 20 d ago#llm#throughput#latency

Predictive Feature Caching for Training-free Acceleration of Molecular Geometry Generation

The article presents a training-free caching strategy for accelerating molecular geometry generation using flow matching models, which typically face high inference costs due to extensive network evaluations. This method predicts intermediate hidden states during solver steps and is compatible with SE(3)-equivariant backbones and pretrained models. Experiments on the GEOM-Drugs dataset show that this caching approach can halve wall-clock inference time while maintaining sample quality, and when combined with other optimizations, can achieve up to a 7x speedup, making it significant for practitioners seeking efficient molecular sampling solutions.

arXiv cs.AI34 d agofound 13 d ago#molecular#geometry#generation

EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning

EquivPruner is a novel approach designed to enhance the efficiency and quality of LLM-based search by identifying and pruning semantically equivalent actions during reasoning processes. It introduces the MathEquiv dataset for training a lightweight equivalence detector, demonstrating significant improvements in token consumption and reasoning accuracy—specifically, a 48.1% reduction in token use while enhancing accuracy on the Qwen2.5-Math-7B-Instruct model tested on the GSM8K benchmark. This advancement is crucial for practitioners aiming to optimize LLM performance in domain-specific contexts, particularly in mathematical reasoning.

arXiv cs.AI34 d agofound 15 d ago#llm#search#efficiency

When Does Intrinsic Self-Correction Help? A Task-Sensitive Analysis

This study analyzes the effectiveness of intrinsic self-correction (SC) in large language models, revealing that its success is highly task-dependent. By investigating various mechanisms such as verifying constraints and revisiting complex reasoning, the authors demonstrate that SC can lead to performance improvements in specific contexts, suggesting that it should be considered a nuanced strategy rather than a universally applicable solution for enhancing model outputs. This insight is crucial for practitioners as it highlights the importance of task structure in determining the utility of SC during inference.

arXiv cs.AI34 d agofound 14 d ago#self-correction#llm#task-sensitive

Denoising Iterative Self-Correction: Structured Verification Loops for Reliable LLM Reasoning

The paper introduces Denoising Iterative Self-Correction (DISC), a novel test-time procedure designed to enhance multi-step reasoning in large language models by using verification outputs as noisy signals to progressively reduce errors. DISC employs a binary judgment gate to maintain the integrity of correct answers while iteratively correcting mistakes, achieving an accuracy of 81.6% on the BIG-Bench Mistake benchmark and outperforming existing methods like Chain-of-Verification and Self-Refine in precision-recall metrics. This approach is significant for practitioners as it offers a structured method to improve the reliability of LLM outputs, particularly in complex reasoning tasks.

arXiv cs.AI34 d agofound 16 d ago#self-correction#llm-reasoning#verification

Less is More: Lightweight Prompt Compression for Question Answering Applications on Edge Devices

The paper introduces CORE, a two-stage sentence-level prompt compression method designed for question answering applications on edge devices, eliminating the need for auxiliary small language models. CORE utilizes named entity recognition and semantic matching to construct and refine answer and clue sets, resulting in a 30.19% accuracy improvement, 50.47% memory reduction, and 1.94 times speedup on an NVIDIA Jetson AGX Orin, along with a 95.74% energy reduction compared to the LLMLingua2 method on smartphones. This advancement is significant for practitioners developing efficient AI applications on resource-constrained devices, enabling enhanced performance with lower computational overhead.

arXiv cs.AI34 d agofound 20 d ago#llm#prompt-compression#qa

Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice

The article presents a new geometry-aware online scheduling method for Large Language Model (LLM) serving, introducing the Smallest Volume First (SVF) algorithm and its efficient variant, 1-bit SVF. This approach significantly improves the management of dynamic memory footprints in inference engines, achieving a competitive ratio reduction from 48 to 5 for known output lengths while demonstrating reduced average and tail latency in extensive evaluations on Llama-3.1 models. This work is crucial for practitioners as it provides a theoretically grounded and empirically validated solution for optimizing memory-constrained scheduling in LLM deployments, with the implementation available as a plug-and-play layer in vLLM.

arXiv cs.AI34 d agofound 20 d ago#LLM#scheduling#optimization

On the Expressive Power of Weight Quantization in Large Language Models

The paper presents a theoretical analysis of weight quantization in large language models, establishing that 1.58 bits is the minimum precision for effective weight quantization. It demonstrates that as quantization bits decrease, the expressive capacity of models diminishes polynomially, highlighting the trade-off between model compression and performance degradation. These insights are crucial for practitioners focused on optimizing model efficiency while maintaining expressive power in LLMs.

arXiv cs.AI34 d agofound 15 d ago#quantization#llm#weight

Simulated Customers Never Walk Away: Decision Fidelity of LLM User Simulators Measured Against Real Purchase Outcomes

This study introduces the concept of decision fidelity in evaluating LLM-based user simulators for conversational AI, highlighting a significant gap in existing frameworks that focus solely on communicative fidelity. The authors analyze 2,790 real customer interactions with LLM sales agents, revealing a "disengagement deficit" where simulators misrepresent non-buyers' behaviors, leading to inflated engagement metrics and misleading training outcomes. This finding is crucial for practitioners as it underscores the need for more accurate simulation models that reflect genuine user decision-making processes to avoid overestimating the effectiveness of AI-driven sales agents.

arXiv cs.AI34 d agofound 20 d ago#llm#user simulation#decision fidelity

SVD-Surgeon: Optimal Singular-Value Surgery for Large Language Model Compression

SVD-Surgeon is a novel training-free method for compressing large language models (LLMs) using singular value decomposition (SVD), enhancing the Optimal Brain Surgeon (OBS) framework. It optimally updates retained singular values to mitigate the loss from truncation and provides a saliency measure for pruning decisions. When applied to the SVD-LLM method, SVD-Surgeon demonstrates improved perplexity-compression trade-offs on the OPT family and LLaMA 2-7B models, making it a valuable tool for practitioners looking to optimize LLM deployment without retraining.

arXiv cs.CL34 d agofound 13 d ago#compression#llm#singular value decomposition

Breaking the Likelihood Trap: Variance-Calibrated Modulation for Large Language Model Decoding

The article introduces Variance-Calibrated Modulation (VCM), a training-free pre-decoding method designed to mitigate the "likelihood trap" in large language models (LLMs). VCM employs two mechanisms: Contextual Searchlight via Pointwise Mutual Information (PMI) to enhance contextually relevant tokens while suppressing stopwords, and Adaptive Self-Debiasing for scale-invariant penalization based on logit standard deviation. This approach improves the diversity and coherence of generated text across tasks such as open-ended generation and factual question answering, with minimal computational overhead, making it a valuable tool for practitioners aiming to enhance LLM performance.

arXiv cs.CL34 d agofound 13 d ago#llm#decoding#variance-calibration

Pessimistic Verification for Open Ended Math Questions

The article introduces a new verification paradigm called pessimistic verification for math-solving agents, which enhances error detection by rejecting solutions flagged by any of multiple parallel verifiers. It also presents progressive pessimistic verification, utilizing fine-grained proof decomposition to improve verification accuracy and efficiency, outperforming existing methods like extended long chain-of-thought workflows. This approach demonstrates significant advancements in solving complex math problems, as validated on the IMO 2025 and MathArena Apex 2025 datasets, making it relevant for practitioners seeking robust verification techniques in AI-driven math solutions.

arXiv cs.AI34 d agofound 15 d ago#verification#math solving#agent workflows

The Language-Energy Divide: Measuring Energy Costs of Multilingual LLM Inference

This study introduces the ML.Energy framework to analyze the energy consumption of multilingual large language models (LLMs) during inference. It reveals that energy costs can vary significantly across languages, with disparities of up to 8.3 times per output token and up to 179 times for total energy consumption for a fixed set of requests, highlighting a systemic energy inequity in multilingual deployments. The findings underscore the importance of incorporating energy efficiency as a critical evaluation metric in LLM development and deployment, particularly for low-resource languages that exhibit both high energy costs and lower task accuracy.

arXiv cs.AI34 d agofound 16 d ago#llm#energy#multilingual#inference

ScalePredictor: Instance-aware Scale Learning for Accurate Quantization of Vision Transformers

ScalePredictor introduces a dynamic quantization framework for Vision Transformers (ViTs) aimed at improving post-training quantization (PTQ) efficiency. It leverages a correlation between shallow-layer activation distributions and optimal scales for deeper layers, employing a polynomial scale projection module for simultaneous quantization scale generation. This approach significantly enhances accuracy while minimizing computational overhead compared to existing static PTQ methods, making it particularly relevant for deploying ViTs on edge devices.

arXiv cs.AI34 d agofound 16 d ago#quantization#vision transformers#scale learning

Not All Claims Are Equally Risky: FACTOR for Adaptive Verification in Factual Long-Form Generation

The article introduces FACTOR (FACTuality-Oriented Risk-aware Verification), an inference-time model designed to enhance the factual accuracy of long-form text generated by Large Language Models (LLMs) by adapting verification processes based on claim-level uncertainty. FACTOR employs uncertainty estimation, adaptive language inference verification, and candidate re-ranking, demonstrating improved factuality and reduced verification costs on the FactScore benchmark. This approach is significant for practitioners as it offers a model-agnostic solution to optimize verification efforts in LLM outputs, addressing the common issue of unsupported factual claims in generated text.

arXiv cs.AI34 d agofound 15 d ago#factuality#verification#llm

Memory Is No Longer a Bottleneck: Memory-Efficient Graph Filtering for Scalable Collaborative Filtering

The article introduces Mem-GF, a memory-efficient graph filtering method for collaborative filtering that addresses the memory bottleneck associated with storing full item similarity graphs. By utilizing Krylov subspaces for approximating polynomial graph filters, Mem-GF achieves up to 5.74 times lower memory usage and 4.38 times faster runtime compared to existing methods, while also improving recommendation accuracy. This advancement is significant for practitioners as it allows for scalable collaborative filtering on large datasets without the prohibitive memory costs typically associated with traditional graph convolutional networks.

arXiv cs.AI34 d agofound 16 d ago#collaborative filtering#graph networks#memory efficiency

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

The article introduces SAGA, a distributed scheduler designed for AI agent inference on GPU clusters, which shifts focus from individual LLM calls to program-level scheduling of entire agent workflows. SAGA implements three mechanisms: Agent Execution Graphs for predicting KV cache reuse, session-affinity batching with work stealing for load balancing, and the Agent Fair Share metric for fairness in task completion. In benchmarks on a 64-GPU cluster, SAGA achieved a 1.64x reduction in task completion time and improved GPU memory utilization by 1.22x, highlighting the importance of workflow-aware scheduling for optimizing latency in compound AI applications, despite a tradeoff in peak throughput.

arXiv cs.AI34 d agofound 14 d ago#scheduling#gpu#llm

Does Mixture-of-Experts Actually Help Inference on Consumer and Edge Hardware? An Empirical Study

The study benchmarks the Mixture-of-Experts (MoE) model OLMoE-1B-7B (1.3B active of 6.9B total parameters) against dense models on consumer and edge hardware, specifically an Apple M2 Pro and an NVIDIA Jetson Orin Nano. Results indicate that while MoE models theoretically reduce per-token compute costs, in practice, they underperform compared to dense models due to factors like total memory footprint and expert dispatch, with OLMoE being 10% slower on the laptop and 31% slower on the edge device. This research highlights that on bandwidth-constrained hardware, inference costs are more influenced by total parameters rather than active ones, suggesting that sparse activation may not significantly improve efficiency in such environments.

arXiv cs.AI34 d agofound 16 d ago#mixture-of-experts#inference#hardware

Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding

Whisper-CD is a novel contrastive decoding framework designed to enhance long-form speech recognition in large encoder-decoder models like Whisper, addressing issues such as hallucinations and repetition loops. It employs a training-free approach that contrasts clean-audio logits with negative logits derived from Gaussian noise, silence signals, and audio temporal shifts, achieving up to a 24.3 percentage point reduction in word error rate (WER) on the CORAAL benchmark and 48% faster token generation compared to traditional beam search. This drop-in solution allows practitioners to improve existing Whisper systems without the need for retraining, making it a practical enhancement for real-world applications.

arXiv cs.AI34 d agofound 14 d ago#speech recognition#contrastive decoding#llm

TIP-Search: Time-Predictable Inference Scheduling for Market Prediction under Uncertain Load

TIP-Search introduces a time-predictable inference scheduling method for market prediction under uncertain load, focusing on delivering timely and accurate predictions. It utilizes a systems-replay controller, OCO-ACPO, and its extension SA-OCO-ACPO, achieving raw accuracy of 0.994 and timely accuracy of 0.991, while significantly improving deadline satisfaction metrics. This work is critical for practitioners as it enhances the reliability of real-time market predictions, balancing accuracy and deadline adherence in dynamic environments.

arXiv cs.AI34 d agofound 15 d ago#market prediction#scheduling#uncertain load

100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+

The user reports achieving approximately 100 tokens per second (t/s) with the Qwen3.6-27B model at Q8_0 using a dual GPU setup of an RTX 5090 and RTX 3090 Ti. The significant performance improvement from 70 t/s to 100+ t/s was attributed to switching to tensor split-mode, which allows both GPUs to work on the same tensors simultaneously, rather than alternating layers. This optimization is crucial for practitioners as it maximizes GPU utilization and throughput, particularly in setups with heterogeneous GPU architectures.

Reddit r/LocalLLaMA34 d agofound 21 d ago#qwen#gpu#performance

Idea for how to run GLM2 at a decent quant, need critique/feedback

The article discusses a proposed setup for running the GLM2 model efficiently using a rig with four NVIDIA 5060 Ti GPUs, leveraging 64 GB of VRAM and aiming to optimize for inference tasks. The author suggests enhancing the system with 512 GB of DDR3 RAM on a compatible server motherboard, such as the Supermicro X9DRi-F, to achieve low-latency performance with the Qwen/Qwen3.6-27B-FP8 model, targeting 72 tokens per second at a maximum context of 262k. This configuration aims to address compute bottlenecks while minimizing costs, making it a potentially viable solution for practitioners focusing on high-performance inference with large language models.

Reddit r/LocalLLaMA34 d agofound 21 d ago#glm2#quantization#benchmarking

Top-N-Sigma: Remove unconditional softmax+sort by TimNN · Pull Request #22645 · ggml-org/llama.cpp

The Pull Request #22645 introduces a modification to the Top-N-Sigma sampler in the ggml-org/llama.cpp repository, eliminating the unconditional softmax and sort operations that were previously performed at the end of the sampling process. This change resulted in a performance improvement, increasing throughput from approximately 30 tokens per second (t/s) to 45 t/s on a MacBook Pro M3 Max, thereby reducing the time per token by 10 milliseconds. This enhancement is significant for practitioners as it optimizes the sampling process, particularly when Top-N-Sigma is used in conjunction with other samplers, potentially leading to more efficient model inference.

Reddit r/LocalLLaMA34 d agofound 21 d ago#Top-N-Sigma#sampling#optimization

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

The GLM-5.2 model, specifically the unsloth/GLM-5.2-GGUF version with UD-IQ1_M quantization, was tested on a system with RTX 5090 and RTX 3090 Ti GPUs, achieving prefill speeds of approximately 579 tokens per second at an 8k context and 324 tokens per second at a 57k context. The model maintained a steady decoding speed of 10.6 tokens per second over 580+ tokens, demonstrating the performance capabilities of the architecture with a 128k context and q8_0 KV cache, which is significant for AI practitioners focusing on optimizing LLM performance and resource allocation.

Reddit r/LocalLLaMA34 d agofound 21 d ago#GLM-5.2#speed_test

Qwen3.6-35B-A3B APEX on a Single RTX 3090 - Getting the Most Out of It

The article discusses the optimization of the Qwen3.6-35B-A3B model on an RTX 3090 GPU, focusing on achieving high quality and speed with a minimum context length of 128k. Benchmark tests using two forks of llama.cpp (ik_llama and spiritbuun) reveal that the ik_llama with the I-Compact APEX model achieves the highest decoding speed (~146 TPS), while spiritbuun's I-Quality model maintains competitive speeds (~137 TPS) with better quality metrics. This information is crucial for practitioners looking to maximize performance and efficiency when deploying large language models on limited hardware.

Reddit r/LocalLLaMA34 d agofound 21 d ago#Qwen#RTX_3090#optimization

Gemma 4 31B Q6 on Dual 9060 XT

Gemma 4, a 31 billion parameter model, has been tested on a dual setup of 9060 XT GPUs with 16GB memory each, achieving a throughput of approximately 8-9 tokens per second. This performance is perceived as lower than expected by some users, indicating potential optimization opportunities for practitioners. The findings are relevant for developers seeking to optimize LLM performance on specific hardware configurations.

Reddit r/LocalLLaMA35 d agofound 21 d ago#gemma#performance#benchmarking

Nemotron ultra living on the edge on 4 sparks

The article discusses the deployment of the Nvidia Nemotron-3 Ultra model, which features a massive 550 billion parameters, on a unified memory device using the eugr/spark-vllm-docker framework. The author notes challenges with memory management, particularly with achieving 95% memory usage, highlighting the complexities involved in optimizing large language models for edge computing environments. This release is significant for practitioners as it demonstrates the practical application and limitations of running large-scale models in constrained settings, emphasizing the need for advanced memory management techniques.

Reddit r/LocalLLaMA35 d agofound 21 d ago#qwen#performance

A100 slow Qwen3.6-27B-FP8

The performance of the Qwen3.6-27B-FP8 model on an NVIDIA A100 80GB GPU was benchmarked, revealing a decoding rate of 43 transactions per second (tps) for single requests and 177 tps for eight concurrent requests. In contrast, the same model configuration on an RTX 6000 PRO achieved 130 tps for single requests and 509 tps for concurrent requests, indicating a significant performance discrepancy. This highlights potential optimization considerations for practitioners using the A100 with FP8 models and raises questions about the efficiency of hardware utilization for specific workloads.

Reddit r/LocalLLaMA35 d agofound 21 d ago#qwen#performance

2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp

The article discusses the implementation of a multi-GPU setup using two Gigabyte Radeon AI PRO R9700 GPUs to run the Qwen 3.6 model with 27 billion parameters in a llama.cpp environment. Key performance metrics include decoding rates of 46-67 tokens per second for various context sizes up to 102k tokens and prefill throughput of approximately 1,200-1,500 tokens per second for prompts under 10k tokens. This setup is significant for practitioners as it demonstrates effective use of high VRAM GPUs for large context processing in AI applications, alongside insights into optimizing token generation and memory management.

Reddit r/LocalLLaMA35 d agofound 21 d ago#qwen#multi_gpu

ROCm vs Vulkan vs vLLM on Dual R9700's

The article presents performance benchmarks for the Qwen 3.6 models (35B-A3B and 27B) using different backends: ROCm, Vulkan, and vLLM. The vLLM backend demonstrated significant improvements, achieving up to 156 tokens per second (t/s) for the 35B-A3B model with ROCm + AITER, compared to 106 t/s and 87 t/s for ROCm and Vulkan, respectively. This indicates that vLLM could be a more efficient option for practitioners looking to optimize model performance and concurrency in AI applications.

Reddit r/LocalLLaMA35 d agofound 21 d ago#qwen#performance

Rollin' MiMo-2.5 on two Halo Strixeses

The article discusses the deployment of the MiMo-2.5 model on two 128GB machines equipped with Intel Xeon 8060 processors, using Proxmox for container management and a USB4 network secondary link. It reports achieving 356 perplexity and 15 token generation metrics at a context length of 10,000 tokens, highlighting the challenges faced in building and serving models with various backends like vllm and sglang on consumer hardware. This information is relevant for practitioners as it outlines practical performance benchmarks and the complexities of model deployment in non-datacenter environments.

Reddit r/LocalLLaMA35 d agofound 21 d ago#mimo#performance

8-16 MI50s Minimax M3 @19 tps TG (peak)

The article discusses performance benchmarks for the MiniMax M3 model running on 8-16 MI50 GPUs, achieving a peak throughput of 19 tokens per second (TPS) for text generation. The inference engine utilized is a fork of VLLM (v0.23.1) with ROCm 7.2.1, and the setup includes optimizations such as INT4 quantization and FP16 dequantization. These results highlight potential improvements in speed and output quality for practitioners, particularly in optimizing software and hardware configurations for agentic coding tasks.

Reddit r/LocalLLaMA36 d agofound 21 d ago#minimax#performance

Gemma 4 QAT seems to respond significantly better to KV cache quantization

Gemma 4's quantization-aware training (QAT) model shows improved performance with key-value (KV) cache quantization, particularly with a Q8_0 configuration. Benchmark results using KL Divergence on Wikitext with a 16k context indicate that the model maintains a 99.9% KLD, suggesting effective attention retention on important tokens. This improvement is significant for practitioners as it enhances the efficiency of deploying LLMs in resource-constrained environments while maintaining performance.

Reddit r/LocalLLaMA36 d agofound 21 d ago#quantization#kv_cache#gemma

GLM 5.2, what speeds are we getting locally?

The community is discussing performance metrics for the GLM 5.2 model when run locally, soliciting reports on inference engines, system specifications, quantization methods, context sizes, and tokens per second. One user reported using the llama.cpp framework on a system with 6 RTX 3090 GPUs and an i7-13700K processor, achieving 7.8 tokens/sec for generation with a 90K context size and Q8_0 KV quantization. This information is crucial for practitioners optimizing local deployments of large language models, as it provides benchmarks to assess performance under various configurations.

Reddit r/LocalLLaMA36 d agofound 22 d ago#glm#inference#performance

RTX 5090 MSI, only inference or training at 475-500W. Make sure to not bend you cable!

The MSI RTX 5090 operates at a power draw of 475-500W, primarily utilized for diffusion training and LLM inference. The user emphasizes the importance of ensuring that the power cable is not bent to avoid potential issues, highlighting the card's reliability in AI and machine learning tasks. This information is crucial for practitioners considering power management and hardware setup for intensive AI workloads.

Reddit r/LocalLLaMA37 d agofound 22 d ago#RTX 5090#inference#training

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

The article discusses a setup for running the Qwen3.6-27B model in an inference-only context using four NVIDIA 5060 Ti GPUs, totaling approximately $1,800 in GPU costs. The configuration achieves a benchmark output token throughput of 55.67 tokens per second with a maximum context length of 262,144 tokens and utilizes a BF16 KV cache. This setup is significant for practitioners as it demonstrates a cost-effective way to leverage large language models for inference, highlighting the importance of efficient GPU utilization and configuration in real-time applications.

Reddit r/LocalLLaMA37 d agofound 22 d ago#qwen#context#performance

Maximizing performance of 2x3090 + NVLink

The article discusses a user setup featuring dual NVIDIA GeForce RTX 3090 GPUs connected via NVLink, running on Ubuntu 24.04 with a Ryzen 7950x3d processor and 64GB of DDR5 RAM. The user reports achieving a maximum throughput of approximately 60 tokens per second (TPS) during brief bursts, with an average around 40-45 TPS while utilizing the Qwen 3.6 27B Q8_0 model with MTP and graph splitting techniques. This highlights the performance limitations of high-end consumer hardware configurations in AI workloads, prompting discussions on optimization strategies among practitioners.

Reddit r/LocalLLaMA37 d agofound 24 d ago#performance#3090#nvlink

2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

The article discusses the setup and performance improvements achieved using two AMD Radeon AI PRO R9700 GPUs (RDNA4 architecture) with vLLM version 0.22.1. The key technical advancement was addressing the long-context decode performance drop, which previously saw a significant decline in throughput from ~100 tok/s at 8K context to just 14 tok/s at 79K context, attributed to unoptimized ROCm attention paths. By implementing AITER Unified Attention, the authors were able to mitigate this issue, enhancing the decoding efficiency for large context sizes, which is critical for practitioners aiming to optimize LLM performance on AMD hardware.

Reddit r/LocalLLaMA37 d agofound 24 d ago#vllm#long-context#decode

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

StreamKL introduces a novel GPU primitive for efficiently computing Kullback-Leibler (KL) divergence in attention distillation, significantly reducing memory and I/O costs associated with existing methods. By employing an online formulation that streams query-key tiles through on-chip SRAM, StreamKL achieves up to 43x speedup in the forward pass and 14x in the backward pass, while minimizing the memory footprint from O(N_QN_K) to O(1). This advancement allows for long-context attention distillation on a single GPU, making it a critical tool for practitioners in knowledge distillation and model compression.

arXiv cs.AI38 d agofound 23 d ago#attention distillation#kl divergence#optimization

Where to Place the Query? Unveiling and Mitigating Positional Bias in In-Context Learning for Diffusion LLMs via Decoding Dynamics

This paper investigates the effects of query placement in In-Context Learning (ICL) for Diffusion Large Language Models (dLLMs), highlighting that query position acts as a first-order variable affecting generation quality. It introduces Average Confidence ($\overline{C}$) as a novel metric for evaluating decoding processes and presents Auto-ICL, an adaptive routing strategy for optimizing query placement without the need for ground-truth labels. These advancements are significant for practitioners as they enhance the understanding of dLLM behavior and provide methods to improve model performance in various reasoning and perception tasks.

arXiv cs.AI38 d agofound 23 d ago#query placement#in-context learning#diffusion models

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E is a novel test-time method designed to enhance the performance of frozen vision-language models (VLMs) by optimizing question-conditioned visual spotlights through lightweight tuning with Group Relative Policy Optimization (GRPO). It introduces an entropy-shaping objective that balances answer-span prediction uncertainty while maintaining high-confidence tokens, resulting in improved robustness and consistent performance gains across various benchmarks and VLM families. This approach is significant for practitioners as it allows for enhanced grounding in evidence-intensive tasks without the need for retraining models.

arXiv cs.AI38 d agofound 23 d ago#visualization#vlm#evidence

GLARE: A Natural Language Interface for Querying Global Explanations

The article presents GLARE, an LLM-based interactive interface designed for querying global explanations of black-box image classifiers using natural language. The system utilizes a core LLM to convert user questions into structured SQL queries, allowing for flexible data aggregation and presenting results as statistics-augmented natural language responses with visualizations. This approach enhances the accessibility and usability of global explanations in explainable AI (XAI), particularly for practitioners seeking targeted insights from complex models.

arXiv cs.AI38 d agofound 24 d ago#explanations#querying#nlp

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

The paper evaluates the effectiveness of per-token KL divergence (KLD) as a fidelity metric for quantized large language models (LLMs), specifically analyzing a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B. The study finds a strong correlation between KLD and benchmark scores across the cohorts, but this correlation diminishes significantly in low-performance scenarios, indicating that while KLD can indicate disagreement volume, it lacks reliability in predicting model performance across different tasks. This research highlights the need for more robust metrics in assessing quantized LLMs, particularly for practitioners focused on deployment in diverse applications.

arXiv cs.CL38 d agofound 22 d ago#fidelity-metrics#quantization#llm#benchmarking

Closing the Calibration Gap in Semantic Caching

The article introduces a new evaluation framework for semantic caching in large language models (LLMs), addressing the inadequacies of the PR-AUC metric by proposing two novel metrics: Precision-Cache Hit Ratio (P-CHR) AUC and Calibration Retention Rate (CRR). These metrics account for cache utilization and the retention of ranking quality in real-world deployments, revealing that the operational gap between offline model performance and deployment efficacy is primarily influenced by the training objective rather than dataset scale. This work emphasizes the importance of calibration in model selection for semantic caching, suggesting that practitioners should focus on calibration metrics to improve deployment outcomes.

arXiv cs.CL38 d agofound 22 d ago#semantic-caching#calibration#llm#metrics

Token-Operations-Oriented Inference Optimization Techniques for Large Models

The paper introduces a novel four-layer technical architecture for optimizing inference in large models, focusing on token-oriented techniques. It includes Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion, providing a comprehensive review of related technologies and their application in real-world scenarios. This approach aims to reduce token production costs and enhance service efficiency, which is crucial for practitioners aiming to improve the operational stability and scalability of large model services.

arXiv cs.CL38 d agofound 22 d ago#inference-optimization#large-models#token-operations

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

The paper introduces GRACE (Granularity-Regulated Adaptive Computational Efficiency), a theoretical framework for optimizing verification granularity in test-time scaling (TTS) of large language models (LLMs). It identifies a phase transition in verification strategy effectiveness based on compute budget and problem difficulty, demonstrating that fine-grained verification is superior for complex problems with ample compute, while coarse-grained verification is better for simpler tasks with limited resources. Empirical results across MATH-500, GSM8K, and AIME benchmarks show that the adaptive strategy can improve accuracy by up to 3.1% compared to fixed-granularity approaches, offering a significant advancement for practitioners in optimizing inference performance.

arXiv cs.CL38 d agofound 22 d ago#LLM#verification#computational efficiency

Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

The paper introduces \sevra (Selective Verification for Reasoning Allocation), a deployment framework that optimizes the use of computational resources during test-time reasoning by deciding when to verify a frozen solver's initial output. Using the Qwen3-4B model, \sevra achieves 76.3% accuracy with a 26.8% reduction in post-generation tokens compared to constant verification, while also minimizing harmful answer changes. This approach highlights the importance of budget-aware reasoning strategies, suggesting that practitioners can enhance efficiency and accuracy by selectively verifying outputs rather than adopting a one-size-fits-all verification approach.

arXiv cs.AI38 d agofound 24 d ago#reasoning#verification#budget

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

A new framework called MACR has been proposed for resolving conflicts between parametric and contextual knowledge in large language models (LLMs). This framework utilizes an adaptive knowledge assessment approach based on a modified semantic entropy measure to evaluate the model's confidence and employs an inductive multi-agent reasoning system to analyze and resolve inconsistencies among various knowledge sources. MACR has shown significant improvements over existing state-of-the-art methods in empirical benchmarks, enhancing the reliability of LLMs in scenarios where both internal and external information may be erroneous.

arXiv cs.AI38 d agofound 24 d ago#llm#knowledge#conflict resolution

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

NIM4-ASR is a new LLM-based automatic speech recognition framework that addresses scalability and robustness in resource-constrained environments. It features a redesigned multi-stage training paradigm, including a pre-training architecture aimed at reducing modality gaps, an asynchronous supervised fine-tuning stage to maintain acoustic fidelity, and a reinforcement learning component to enhance recognition quality. With only 2.3 billion parameters, NIM4-ASR achieves state-of-the-art performance on public benchmarks and excels in real-world scenarios, supporting rapid hotword customization through retrieval-augmented generation for efficient adaptation to user needs.

arXiv cs.CL38 d agofound 22 d ago#asr#llm#efficiency

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

The article introduces S2D2, a training-free self-speculative decoding framework designed for block-diffusion language models, which enhances decoding speed without additional training or significant test-time compute. S2D2 allows a pretrained block-diffusion model to function as both drafter and verifier by reducing block size to one, resulting in a hybrid decoding method that improves the accuracy-speed tradeoff. Benchmark results indicate S2D2 achieves up to 4.7× speedup over autoregressive decoding and up to 1.57× over dynamic baselines, while enhancing accuracy by up to 4.5 points, making it a valuable tool for practitioners seeking efficient LLM generation.

arXiv cs.CL38 d agofound 22 d ago#diffusion-models#decoding#self-speculation

GLM-5.2 (744B, 2-bit) at 7.3 tok/s on 4×3090 + 192GB — and why IQ1_M wasn't any faster

The GLM-5.2 model, featuring 744 billion parameters and utilizing a 2-bit quantization, achieves a decoding speed of approximately 7.3 tokens per second when run on a setup with four RTX 3090 GPUs and 192GB of RAM. Key findings indicate that decoding performance is primarily limited by CPU compute when offloading experts, rather than memory bandwidth, with an observed 22% speed increase when doubling CPU threads from 6 to 12. This information is crucial for practitioners as it highlights the importance of CPU resources and expert distribution for optimizing performance in large language model deployments.

Reddit r/LocalLLaMA38 d agofound 24 d ago#glm-5.2#performance#3090

AI inference startup Baseten reportedly raising $1.5B months after its last mega-round

Baseten is reportedly nearing the completion of a $1.5 billion funding round, valuing the company at $13 billion. This funding comes amid a growing demand for AI inference solutions, indicating significant investment interest in technologies that optimize model deployment and inference performance, which is critical for practitioners seeking scalable AI applications.

TechCrunch AI38 d agofound 24 d ago#ai#inference#funding

I have an old multi-GPU node lying around at work...

The article discusses a multi-GPU node featuring 8 NVIDIA Quadro RTX 6000 GPUs, totaling 192 GB VRAM, alongside 512 GB RAM and 112 CPU threads, which is currently underutilized. The author seeks to identify models that can leverage this hardware for local inference, particularly those that would benefit from the increased computational resources compared to a single GPU setup. This setup could facilitate the deployment of larger models or those requiring significant parallel processing, which is crucial for practitioners aiming to optimize inference times and handle more complex tasks in AI applications.

Reddit r/LocalLLaMA38 d agofound 25 d ago#gpu#inference#local#models

Cutting LLM Token Costs with rtk, headroom, and caveman - savings measured on real workloads

The article evaluates the token cost savings of three tools—rtk, headroom, and caveman—when applied to real workloads from Claude Code sessions, totaling 614M tokens and $926 in baseline spend. The results showed modest savings: headroom achieved a 2.8% reduction ($25.61), rtk 0.5% ($4.94), and caveman 0.4% ($3.58), with a combined savings of 3.7% ($34.12). The limited impact on overall costs is attributed to the nature of the workloads, where high-compression techniques are less effective on plain text and source code, and the majority of the billing is derived from cache reads and outputs that these tools do not optimize.

Reddit r/LocalLLaMA38 d agofound 25 d ago#llm#cost_reduction#optimization

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

The article introduces Dynamic Sliding Block (DSB), a novel training-free block scheduling method for diffusion large language models (dLLMs) that adapts block sizes based on semantic difficulty, enhancing both output quality and inference efficiency. It also presents DSB Cache, a KV-cache mechanism designed to optimize the DSB approach. Experimental results show significant improvements in generation quality and efficiency across multiple models and benchmarks, making it a valuable advancement for practitioners working with dLLMs.

arXiv cs.CL39 d agofound 24 d ago#diffusion#llm#scheduling

How to turn off AI in your Google Docs

The article provides instructions for disabling the "write with Gemini" feature in Google Docs, which is part of Google's integration of AI capabilities into their document editing platform. This adjustment allows users to eliminate AI prompts and operate without the assistance of the Gemini model. This is relevant for practitioners who may want to maintain a traditional workflow without AI interference in their document creation process.

TechCrunch AI39 d agofound 25 d ago#google docs#ai#settings

llama.cpp - how to free up even more space on your GPU

The article discusses optimizations for the llama.cpp framework to enhance GPU memory efficiency when running large models, specifically the Qwen3.6-27B-UD-Q5_K_XL-mtp with 150k context. Key techniques include using flags such as `--no-mmproj-offload` to offload memory to the CPU, `--cache-type-k` and `--cache-type-v` to reduce memory allocation by up to 75%, and `--spec-draft-n-max` to predict future tokens, which balances memory usage and throughput. These optimizations are crucial for practitioners looking to maximize context size and performance on limited GPU resources, particularly in setups with high VRAM demands.

Reddit r/LocalLLaMA39 d agofound 29 d ago#llama.cpp#gpu-optimization

S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices

The paper presents a novel operator-level pruning method for Structured State Space Models (SSMs), specifically targeting the S4 and S4D architectures, to enhance their deployment in resource-constrained environments. This approach allows for the pruning of up to 70% of model operators while maintaining predictive performance, achieved through a combination of structured masking and fine-tuning within a unified training framework. The findings indicate that this method effectively reduces inference latency, making SSMs more viable for practical applications where computational resources are limited.

arXiv cs.AI40 d agofound 28 d ago#state-space-models#pruning#resource-constraints

Prefill/Decode-Aware Evaluation of LLM Inference on Emerging AI Accelerators

This paper evaluates the inference performance of large language models (LLMs), specifically Llama2-7B, across GPUs and emerging AI accelerators by analyzing Prefill and Decode phases separately. The study reveals that GPUs outperform in the compute-intensive Prefill phase, while GroqRack offers lower time per output token during the Decode phase, although it lacks batching support. These insights are crucial for practitioners as they highlight the importance of phase-dependent performance characteristics, guiding the selection of hardware for specific LLM workloads.

arXiv cs.AI40 d agofound 28 d ago#llm#inference#performance#evaluation

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

LoopCoder-v2 introduces a family of 7 billion parameter Parallel Loop Transformers (PLT) designed to optimize test-time computation by utilizing cross-loop position offsets (CLP) and shared-KV gated sliding-window attention. The model was trained on 18 trillion tokens and demonstrated significant performance improvements across various benchmarks, notably enhancing the SWE-bench Verified score from 43.0 to 64.4 with a two-loop configuration, while higher loop counts resulted in diminishing returns due to positional mismatches. This research provides insights into the trade-offs of loop count selection, which is crucial for practitioners aiming to balance computational efficiency and model performance in AI applications.

arXiv cs.AI40 d agofound 28 d ago#transformers#computation

SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches

SketchXplain is a novel approach for generating intuitive visual explanations of image classifiers by utilizing artistic sketches. It integrates saliency maps, concept-bottleneck models, and sketch optimization to produce coherent and simplified representations of image data. Evaluations on tasks such as face expression recognition and skin lesion diagnosis demonstrate that SketchXplain facilitates quicker interpretation and more aligned visualizations compared to traditional methods, highlighting its potential to improve explainability in AI applications.

arXiv cs.AI40 d agofound 28 d ago#explainable ai#visualization

ANEForge: Python for direct computation on the Apple Neural Engine

ANEForge is a new Python package that enables direct programming of the Apple Neural Engine (ANE) without relying on CoreML, providing a more efficient pathway for utilizing the ANE's capabilities. It compiles a lazy tensor graph from 58 fused operators and 19 native bridge operators into a single ANE program, achieving low latency with a small fused program completing in approximately 90 microseconds. This tool is significant for practitioners as it allows for advanced model training and inference on Apple Silicon devices, including support for int8, int4, and sparse weights, making it a valuable resource for optimizing AI workloads on macOS 14 and later.

arXiv cs.AI40 d agofound 28 d ago#apple#neural#engine#python

MIVE: A Minimalist Integer Vector Engine for Softmax LayerNorm and RMSNorm Acceleration

The article introduces the Minimalist Integer Vector Engine (MIVE), a programmable architecture designed to accelerate non-linear vector normalization operations such as LayerNorm, RMSNorm, and Softmax within a unified datapath. By leveraging common computational patterns, MIVE enhances hardware sharing and reduces implementation overhead, resulting in higher area and hardware efficiency compared to existing standalone accelerators. This development is significant for practitioners as it addresses critical bottlenecks in LLM inference, offering a more efficient solution for specialized hardware accelerators.

arXiv cs.AI40 d agofound 28 d ago#hardware acceleration#softmax#layernorm

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

The article presents a new approach to quantization-aware training (QAT) for State Space Models (SSMs), specifically the Mamba-2 model, achieving a compression of 3.61x from 1.3B parameters to 744 MB while maintaining competitive performance with 48.1% zero-shot accuracy after training on just 102M tokens. This method utilizes grouped QAT with knowledge distillation from a frozen FP16 teacher, significantly reducing the required training data and time, while also introducing the concept of zero-ratio collapse, a challenge unique to learnable quantization scales in SSMs. This advancement is crucial for practitioners as it allows for efficient deployment of large models on edge devices without the need for extensive training from scratch.

arXiv cs.AI40 d agofound 28 d ago#quantization#state-space-models#resource-constraints

LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models

LLMCodec is a novel compression method for large language models (LLMs) that leverages video codecs, specifically the VVC/H.266 codec, to efficiently reduce model weights without relying on fine-tuning or calibration data. The method integrates affine quantization and demonstrates significant performance improvements, achieving over 1.5x reduction in perplexity and a 21% increase in downstream task accuracy on the LLaMA-3-8B model at 2-bit precision. This approach offers a robust and generalizable solution for practitioners facing challenges in model storage and deployment, enhancing the efficiency of LLMs while maintaining performance.

arXiv cs.AI40 d agofound 25 d ago#compression#llm#video-codecs

How Inference Compute Shapes Frontier LLM Evaluation

The paper evaluates 12 frontier language models across seven challenging benchmarks, highlighting the impact of inference compute on performance. Key findings indicate that larger token budgets significantly enhance model performance, while fixed-budget evaluations may underestimate model capabilities as they advance. The study suggests that evaluations should report performance as a function of inference compute and clarify protocol choices to better reflect model capabilities, particularly in critical applications.

arXiv cs.AI40 d agofound 29 d ago#evaluation#compute#language models

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

The paper introduces LowFormer, a new family of vision backbone networks designed to enhance efficiency in computer vision tasks. It critiques the reliance on MACs as a performance metric, demonstrating that LowFormer, which incorporates a lightweight design feature called Lowtention, achieves superior execution speed and accuracy on benchmarks like ImageNet. This architecture is particularly beneficial for practitioners working on edge devices, as it offers significant speed improvements across various hardware platforms and is adaptable for multiple downstream applications such as object detection and semantic segmentation.

arXiv cs.AI40 d agofound 25 d ago#architecture design#vision backbones

RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models

The paper presents RLRC, a novel three-stage compression and recovery pipeline for Vision-Language-Action (VLA) models, which employs structured pruning, supervised fine-tuning (SFT), and reinforcement learning (RL) for performance recovery. RLRC achieves up to an 8x reduction in memory usage and a 2.3x speedup in inference without compromising task success rates, outperforming existing compression techniques across multiple VLA architectures. This method is significant for practitioners aiming to deploy VLA models on resource-constrained devices, enhancing their efficiency and practicality in real-world applications.

arXiv cs.AI40 d agofound 28 d ago#vision-language-action#model compression

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

The article introduces Top-Theta (Top-$\theta$) Attention, a method for sparsifying transformer attention during inference without retraining, utilizing calibrated per-head thresholds to maintain a constant number of significant elements per attention row. This approach achieves a 3-10x reduction in V-cache usage and up to 10x fewer attention elements, with a maximum accuracy degradation of 1% across various natural language processing tasks. This technique offers a practical alternative to traditional top-k attention, making it significant for practitioners aiming to optimize resource usage in transformer models.

arXiv cs.AI40 d agofound 28 d ago#transformers#sparsity#attention

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench is a newly introduced dataset designed to evaluate how language models manage off-procedure inputs in grounded diagnostic dialogues, consisting of 50 industrial diagnostic flowcharts and 1,676 multi-turn conversations. The evaluation of ten commercial and open-weight models indicates significant variability in their abstention rates, with a tendency to select contextually inadequate steps rather than generating false information. This highlights a critical vulnerability in grounding systems, emphasizing the need for improved handling of out-of-scope queries in practical applications.

arXiv cs.AI40 d agofound 29 d ago#language models#diagnostic dialogue#evaluation

Towards Distributed Inference of LLMs on a P2P Network

This article presents a decentralized, prefix-cache-aware routing scheme for peer-to-peer (P2P) serving of large language models (LLMs), addressing the challenge of partitioned KV caches across nodes. The proposed method utilizes local radix trees for cached prefixes and employs asynchronous updates of peer cache estimates, enhancing inference latency without central coordination. Evaluation on simulated MMLU workloads indicates that this approach effectively reduces latency under optimal conditions, although network latency and distribution skew can diminish its advantages, highlighting important considerations for practitioners in distributed LLM deployment.

arXiv cs.AI40 d agofound 29 d ago#llm#decentralized#routing#cache

A Neuromorphic Trigger for Efficient Audio Event Detection

This paper presents a neuromorphic trigger for audio event detection utilizing a spiking neural network (SNN) to efficiently process continuous audio streams. The lightweight SNN selectively gates input to downstream models, achieving a one-second segment-based F1 score of 0.97 for anomalous sound detection on the URBAN-SED dataset and demonstrating a 42.6× reduction in FLOPs for sound event detection when combined with the Dang classifier on the DCASE 2017 Challenge Task 2 dataset. This approach significantly enhances real-time processing efficiency and reduces computational costs, making it relevant for practitioners developing resource-constrained audio analysis systems.

arXiv cs.AI40 d agofound 28 d ago#audio event detection#neuromorphic trigger