Coding — AI news — AI News Digest

I reverse engineered Windows Copilot into a free OpenAI compatible API (GPT-4, no API key, no billing)

A developer has reverse-engineered the Windows Copilot to create an unofficial API that allows users to access GPT-4 without an API key or billing, utilizing their own Microsoft account. The setup exposes a local server at `http://localhost:8000/v1`, enabling compatibility with the OpenAI SDK for streaming and multi-turn conversations, making it a cost-effective solution for lightweight AI workloads and automation. This project offers practitioners a way to leverage GPT-4 capabilities for personal and educational use without incurring costs associated with standard API access.

Reddit r/LocalLLaMA32 d agofound 12 d ago#openai#api#gpt-4

Has anyone else found vLLM outputs noticeably worse than llama.cpp for the same model?

A user has reported discrepancies in output quality between vLLM and llama.cpp while testing the same model under similar settings and quantizations. Although vLLM demonstrates superior performance and concurrency, it exhibits issues such as formatting errors, context retention failures, and lower quality code outputs. This raises questions about the impact of quantization, configuration, and template parsing on inference quality, which is critical for practitioners optimizing model deployment and performance.

Reddit r/LocalLLaMA32 d agofound 12 d ago#vllm#llama.cpp#comparison

llama.cpp's web UI now supports executing model generated JavaScript in the browser, through Web Workers (opt in)

The recent update to llama.cpp's web UI includes a new `run_javascript` tool that enables the execution of model-generated JavaScript within the browser using Web Workers. This feature operates in a sandboxed iframe, providing security guarantees, though it currently restricts network requests and lacks clear documentation on sandbox limitations. This enhancement allows practitioners to leverage language models for lightweight code execution directly in the UI, potentially reducing the need for external tools.

Reddit r/LocalLLaMA32 d agofound 12 d ago#llama.cpp#javascript

16 Best Generative AI Coding Tools in 2026 Compared: Features, and Best Fit

The article discusses the evolution of generative AI coding tools by 2026, highlighting their capabilities in full application generation and multi-agent build pipelines. It emphasizes the use of large language models trained on code, which can understand context and intent to produce functional software components with minimal manual input. This advancement is significant for practitioners as it streamlines the software development process, enabling faster and more efficient coding practices.

MarkTechPost33 d agofound 12 d ago#generative_ai#coding_tools

Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules

The article introduces Rule2Text, a framework that utilizes large language models (LLMs) to generate natural language explanations for complex logical rules derived from knowledge graphs (KGs). Extensive experiments were conducted using datasets like Freebase variants and ogbl-biokg, employing models such as Gemini 2.0 Flash and the open-source Zephyr model, which was fine-tuned for improved explanation quality. This framework enhances KG usability by providing interpretable outputs, making it valuable for practitioners aiming to improve human understanding of KGs through LLM-generated explanations.

arXiv cs.AI33 d agofound 10 d ago#knowledge_graphs#explanations#LLM

AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach

The paper introduces AI-PAVE-Br, a specialized system utilizing Large Language Models (LLMs) for high-accuracy Product Attribute Value Extraction (PAVE) tailored to the Brazilian e-commerce sector. It also presents the Golden Set, a curated dataset with annotated product attributes in Portuguese, which serves as a benchmark for PAVE research. The results demonstrate that AI-PAVE-Br, through targeted prompt engineering, significantly surpasses traditional Named Entity Recognition (NER) methods, providing a scalable solution for non-English markets and contributing valuable resources for NLP research.

arXiv cs.AI33 d agofound 10 d ago#llm#e-commerce#data-extraction

VeriPilot: An LLM-Powered Verilog Debugging Framework

VeriPilot is a newly proposed LLM-powered framework designed to enhance Verilog debugging by utilizing golden reference models for effective bug localization and repair. It employs Control-Data-Flow Graphs (CDFGs) derived from static analysis to facilitate step-by-step signal tracing, significantly improving the bug repair success rate of GPT-4o from 54.3% to 85.71% on the Comprehensive Verilog Design Problems (CVDP) benchmark. This advancement addresses the challenge of tracing long dependency chains in complex codebases, making it a valuable tool for practitioners in digital circuit design.

arXiv cs.AI33 d agofound 10 d ago#llm#verilog#debugging

Accuracy and Satisfaction in Multi-Turn LLM Dialogues for NFR Assessment

The paper presents an evaluation of LLM-based dialogue systems, specifically GitHub Copilot, in the context of assessing Non-Functional Requirements (NFRs) related to HIPAA compliance. It identifies limitations in current benchmarks that focus on functional correctness, proposing new methods to evaluate multi-turn interactions based on requirement satisfaction, reasoning, and code localization. The study reveals a discrepancy between developer agreement with LLM outputs and low accuracy against expert assessments, highlighting the need for improved designs in LLM dialogue systems to enhance satisfaction and effectiveness in collaborative reasoning.

arXiv cs.AI33 d agofound 10 d ago#llm#dialogue#nfr assessment

Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories

A multi-layered detection framework has been developed to analyze AI coding agents across 180 million Git repositories, revealing significant insights into their prevalence and activity. The study identified 850,157 commits attributed to Claude Code, with a notable detection gap where traditional methods underestimated agent presence by a factor of 30. This research highlights the limitations of single-method detection approaches and underscores the importance of multi-method strategies for accurately understanding AI agent contributions in open-source projects, as different detection channels capture distinct populations and types of work.

arXiv cs.AI33 d agofound 10 d ago#generative ai#open source#detection

SemChunk-C: Semantic Segmentation for C Code

The paper introduces SemChunk-C, a family of lightweight language models designed for semantic segmentation of C-related code, utilizing four Ettin encoders with parameter sizes of 17M, 32M, 68M, and 150M. The models effectively identify chunk boundaries and assign functional attributes, achieving high accuracy and semantic coherence on real-world code, including complex constructs like nested definitions and macros. This advancement is significant for practitioners as it enhances code retrieval and other downstream tasks by providing more meaningful functional units compared to existing methods.

arXiv cs.AI33 d agofound 10 d ago#semantic segmentation#code chunking#llm

IV-CoT: Implicit Visual Chain-of-Thought for Structure-Aware Text-to-Image Generation

The article introduces Implicit Visual Chain-of-Thought (IV-CoT), a novel framework designed to enhance structure-aware text-to-image generation by decomposing visual conditioning into a structural-to-semantic cascade. This method utilizes training-only sketch supervision to guide structural queries, enabling the generation of a latent visual plan that informs the rendering of appearance, thus improving performance on benchmarks like GenEval and T2I-CompBench. IV-CoT's architecture allows for implicit chain-of-thought reasoning in a single forward pass, making it a significant advancement for practitioners focused on precise object and layout representation in generated images.

arXiv cs.AI33 d agofound 10 d ago#llm#questions#unstructured-data

JupOtter: Cell-Level Bug Detection in Jupyter Notebooks

JupOtter is a newly introduced bug detection system tailored for Jupyter Notebooks, featuring a specialized tokenization strategy that maintains cell structure and a cell-level bug prediction technique. It utilizes the OtterDataset, which includes over 21,000 annotated notebooks for fine-grained bug detection, achieving F1 scores that outperform both static analyzers and large language models in two out of three benchmark datasets. This tool is significant for practitioners as it enhances the reliability of complex notebook-based applications by enabling more effective identification of bugs at the cell level.

arXiv cs.AI33 d agofound 10 d ago#jupyter_notebooks#bug_detection#ai_tools

Navigating User Behavior toward Personalized Multimodal Generation

The paper introduces NaviGen, a novel approach for personalized multimodal content generation that enhances alignment between user intent and generated outputs. It utilizes a dual identifier system combining collaborative and textual codes to encode user behavior, followed by a two-stage training pipeline of supervised fine-tuning (SFT) and reinforcement learning (RL) to improve instruction writing and preference reasoning. Experimental results demonstrate that NaviGen significantly enhances the quality of personalized image and video generation, making it a valuable tool for practitioners seeking to refine user interaction in AI-generated content.

arXiv cs.AI33 d agofound 12 d ago#personalized generation#AIGC#user behavior

How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python

The article presents a tutorial on utilizing NVIDIA's Canary-1B-v2 model for automatic speech recognition (ASR) and translation tasks in Python. It details the process of preparing audio data at 16 kHz mono, performing ASR in English, translating into multiple languages (French, German, Spanish, Italian), and exporting subtitles in SRT format, while benchmarking inference speed for performance evaluation. This resource is significant for practitioners looking to implement multilingual ASR and translation capabilities efficiently using state-of-the-art models.

MarkTechPost33 d agofound 21 d ago#nvidia#asr#translation

Fika Jobs raises $4M to build a video-first hiring platform where AI agents interview candidates

Fika Jobs has secured $4 million in funding to develop a video-first hiring platform that integrates AI interview agents with short-form video profiles. This platform aims to streamline the hiring process by leveraging AI to conduct interviews, potentially enhancing candidate assessment and engagement. The approach combines elements of social media with professional networking, which may offer novel ways for practitioners to evaluate candidates using AI-driven insights.

TechCrunch AI33 d agofound 21 d ago#fika jobs#ai agents#hiring platform

Unlimited OCR Works

The article introduces Unlimited OCR, a novel model that enhances OCR performance by utilizing a Reference Sliding Window Attention (R-SWA) mechanism to maintain a constant KV cache during decoding, thereby reducing memory consumption and improving efficiency for long sequences. This model, based on DeepSeek OCR, can transcribe up to 32K tokens in a single pass, making it suitable for extensive document processing. R-SWA's general applicability extends beyond OCR to other tasks like ASR and translation, providing a significant advancement for practitioners in the field.

arXiv cs.CL34 d agofound 13 d ago#ocr#llm

Enhancing Diversity of LLM-Generated Educational Tasks

The paper presents CreativeDC, a prompting framework designed to enhance the diversity of educational tasks generated by large language models (LLMs) while maintaining high utility. By employing a two-stage reasoning process inspired by creativity literature, the method was evaluated in Python programming, yielding a 1.6x increase in distinct high-utility tasks compared to existing baselines. This advancement is significant for practitioners seeking to leverage LLMs for educational content creation, as it addresses the "Artificial Hivemind" effect that leads to homogeneous outputs.

arXiv cs.AI34 d agofound 15 d ago#educational tasks#llm#diversity

CodeTeam: An LLM-Powered Multi-Agent Framework for Repository-Level Code Generation

CodeTeam is an LLM-based multi-agent framework designed for natural language to repository generation (NL2Repo), which separates the software development process into planning, decision-making, and implementation stages. It employs multiple Architect agents for drafting software design sketches, a CTO agent for evaluation and normalization, and Developer agents for code generation, achieving significant improvements on the SketchEval benchmark with an increase of 4.1 and 2.9 points in SketchBLEU for prompt-engineering and supervised fine-tuning variants, respectively. The framework also demonstrates a high average test pass rate of 34.6% and 42.3% on the NL2Repo-Bench benchmark, highlighting its effectiveness in producing functional code and its relevance for practitioners in automating repository-level code generation.

arXiv cs.AI34 d agofound 16 d ago#code generation#multi-agent#llm

Is Agent Code Less Maintainable Than Human Code?

The paper presents CodeThread, a framework for evaluating the maintainability of code produced by coding agents compared to human-written code. Experiments with four coding agents revealed that agent-generated code resulted in a task resolve rate decrease of up to 13.1% when future agents attempted to build upon it, highlighting significant behavioral differences in input validation and error handling. This underscores the importance of assessing maintainability in AI-generated code, as traditional metrics may not capture the underlying issues that lead to increased downstream errors.

arXiv cs.AI34 d agofound 16 d ago#maintainability#agent-code#software-engineering

QAMO: Quality-aware Multi-centroid One-class Learning For Speech Deepfake Detection

The paper introduces QAMO, a Quality-Aware Multi-Centroid One-Class Learning approach for detecting speech deepfakes, which enhances traditional one-class learning by incorporating multiple centroids that represent distinct quality subspaces of bona fide speech. By utilizing a multi-centroid ensemble scoring strategy, QAMO achieves an equal error rate of 5.09% on the In-the-Wild dataset, surpassing previous models. This method is significant for practitioners as it improves the detection of deepfakes by accounting for intra-class variability in speech quality, reducing reliance on quality labels during inference.

arXiv cs.AI34 d agofound 14 d ago#deepfake-detection#speech

Context-Aware Distillation and Ablation for Text2DSL

The article presents advancements in the Text2DSL framework for generating domain-specific language (DSL) code from natural language by implementing context-aware distillation using the DeepSeek-V4-Flash model. This approach enhances the generation process through a structured context defined by BNF grammar, API specifications, and a closed identifier vocabulary, resulting in a significant increase in the PolkitBench corpus from 4,204 to 10,073 valid natural-language-to-Polkit-rule pairs with 100% AST validity and a 99.7% runtime pass rate. The findings underscore the importance of structured context in improving model performance, particularly emphasizing the critical role of vocabulary in enhancing semantic quality and the structural validity of the generated code.

arXiv cs.AI34 d agofound 15 d ago#text2dsl#distillation#code

EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors

EPSVec is a novel method for generating synthetic data that utilizes dataset vectors to enhance the efficiency and privacy of large language model (LLM) generation. By decoupling the privacy budget from the generation process, EPSVec allows for the creation of multiple synthetic samples without incurring additional privacy costs, achieving high fidelity even with limited data. The approach demonstrates superior performance in distributional alignment and downstream utility compared to existing methods, while also reducing computational demands, making it a valuable tool for practitioners working with sensitive datasets.

arXiv cs.AI34 d agofound 14 d ago#synthetic data#differential privacy#llm

TACO: Task-Aware Column Description Generation Using LLMs

TACO (Task-Aware Column Description Generation) is a novel framework designed to generate accurate column descriptions for tabular data, addressing common issues found in existing LLM approaches. It utilizes a three-step pipeline comprising abbreviation expansion, initial description generation enriched with synonyms, and a revision phase that refines outputs through simulated downstream tasks. Experimental results indicate that TACO enhances downstream task performance by up to 32% compared to prior methods, making it a significant advancement for practitioners working with tabular data in NLP applications.

arXiv cs.AI34 d agofound 16 d ago#column-description#llms#nlp

Confident and Wrong: Silent Semantic Failures in Coding Agents

The study introduces the concept of "silent semantic failure" in coding agents, revealing that models like GPT-5 and Llama 4 exhibit high submission rates but low resolution rates for software engineering tasks. Specifically, GPT-5 submits patches 100% of the time but only resolves 44% of tasks, while Llama 4 resolves 18% of tasks despite a 99% submission rate. This highlights a critical gap in evaluating AI models, as current metrics based on submission rates do not accurately reflect their trustworthiness; thus, the authors advocate for new evaluation criteria that prioritize test-verified correctness and the ability to recognize when no action is warranted.

arXiv cs.AI34 d agofound 14 d ago#coding agents#trustworthiness#semantic failures

From Fragments to Paths: Task-Level Context Recovery for Large Industrial Codebases

DeepDiscovery is a new task-level repository-understanding method designed for large industrial codebases, utilizing a two-stage Location-Inference framework to enhance context recovery for complex software engineering tasks. It demonstrates superior performance in file recovery and downstream software engineering tasks, achieving a 78.6% Solve Rate in controlled evaluations and improving Full Recall Rate by up to 9.2 percentage points on large subprojects compared to existing baselines. This advancement is significant for practitioners as it enables more effective coding agents in navigating and understanding extensive codebases without the need for offline preprocessing.

arXiv cs.AI34 d agofound 15 d ago#llm#code#repository#software-engineering

Learning Bug Context for PyTorch-to-JAX Translation with LLMs

The article introduces T2J, a benchmark designed to address translation bugs in converting PyTorch code to JAX using large language models (LLMs). It consists of 20 kernels from the TorchLeet dataset, which were translated by the weak LLM gpt-4o-mini and subsequently debugged by software developers. The T2J benchmark was shown to enhance the translation quality, achieving up to a 20% improvement in the T2J-CodeTrans-Score, highlighting its potential utility for practitioners working on code translation tasks between these frameworks.

arXiv cs.AI34 d agofound 14 d ago#llm#code-translation#pytorch#jax

Large Language Model-Assisted Cleaning of Report-Derived Labels in a Large-Scale Chest CT Dataset

The study evaluates the use of GPT-5.4 for cleaning labels in the CT-RATE chest CT dataset, comprising 24,434 reports and 439,812 label instances across 18 categories. The model achieved a 96.4% agreement rate with existing labels, with a Cohen's kappa of 0.884, indicating high reliability, particularly in identifying discordant labels, which were validated by radiologists. This approach demonstrates the potential of LLMs in enhancing the quality of public imaging datasets, with the cleaned dataset set for public release to aid future research.

arXiv cs.AI34 d agofound 14 d ago#llm#label-cleaning#chest-ct

Reinforcement learning to improve large language model-based automated code compliance systems

The paper introduces P4IR, a two-stage framework designed to enhance the accuracy of large language model (LLM)-based automated code compliance systems. It employs supervised fine-tuning (SFT) to integrate domain knowledge, followed by Group Relative Policy Optimization (GRPO) to refine the generated code skeletons, achieving reductions of up to 23.8% in tree edit distance and 38.6% in token-level Levenshtein distance compared to SFT baselines. This approach demonstrates superior performance over leading LLMs like Claude Opus and GPT-5.2 in zero-shot settings, indicating its potential to improve the reliability of LLMs in generating accurate code representations for compliance tasks.

arXiv cs.AI34 d agofound 14 d ago#reinforcement-learning#code-compliance#llm

SCENIC: Semantic-Conditioned Edge-Aware Neural Framework for Structured IoT Command Generation

The paper introduces SCENIC, a Semantic-Conditioned Edge-Aware Neural Framework designed for structured command generation in edge IoT environments. It utilizes sub-0.2B-scale transformer backbones, achieving a 99.0% exact match rate on the Smart Home Instruct-Bench with a pruned INT8 encoder-decoder model that reduces size by 25.38% while maintaining 91.0% EM@1 accuracy. This framework is significant for practitioners as it enables efficient deployment of language models on resource-constrained edge devices, enhancing smart-home command processing while addressing memory and latency challenges.

arXiv cs.AI34 d agofound 14 d ago#iot#command-generation#neural-networks

Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques

The paper introduces Sarc7, a benchmark for classifying seven types of sarcasm—self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic—using the MUStARD dataset. Classification was tested with various techniques, including zero-shot, few-shot, chain-of-thought, and a novel emotion-based prompting method, with Gemini 2.5 achieving the highest performance at an F1 score of 0.3664. This work is significant for AI practitioners as it enhances sarcasm detection and generation capabilities in large language models, addressing the complexities of human communication.

arXiv cs.AI34 d agofound 14 d ago#sarcasm detection#language models#benchmark

A Dual-Track Framework for Template-Constrained LaTeX Conversion

The article presents a Dual-Track Framework for converting Markdown drafts into LaTeX, addressing limitations of existing deterministic and end-to-end LLM approaches. This framework separates template formatting from document processing, utilizing an offline track for template constraints and an online hybrid pipeline that combines LLMs for reasoning tasks with rule-based engines for deterministic processing. Empirical evaluations show that this method improves structural fidelity and compilation success rates across multiple LaTeX templates, making it a significant advancement for practitioners in document conversion tasks.

arXiv cs.CL34 d agofound 13 d ago#latex#conversion#template

SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

SQLConductor introduces a novel step-wise orchestration learning framework for Text-to-SQL, utilizing a policy model to dynamically select actions based on intermediate artifacts and feedback. It employs Search-to-Policy Learning via Monte Carlo Tree Search and Stability-weighted Supervised Fine-tuning, achieving a 73.2% execution accuracy on the BIRD-Dev dataset while coordinating larger frozen action models. This approach enhances adaptability in real-world database interactions, offering significant improvements over traditional fixed pipelines and prior methods, making it valuable for practitioners developing flexible AI-driven database query systems.

arXiv cs.AI34 d agofound 15 d ago#text-to-sql#policy#learning

Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent

The study introduces a structural codebase index within a coding agent framework utilizing Claude Opus 4.7, demonstrating significant improvements in localization and resolution without incurring additional costs. The experimental setup compared three configurations: with the index, without it, and against an agentic-grep baseline, confirming that the index yields better performance metrics while maintaining lower costs per solved instance. This advancement is crucial for practitioners as it highlights the potential for enhanced code retrieval efficiency in multi-file change scenarios, optimizing coding agent operations.

arXiv cs.AI34 d agofound 20 d ago#coding agent#indexing#LLM

Video2Code: Generating Interactive Webpages from UI Videos via Action-Aware Revisit

Video2Code is a novel approach for generating interactive webpages from UI videos by addressing the challenge of state-transition misalignment in existing models. It employs action-aware techniques to identify critical action regions in videos, allowing for higher temporal resolution analysis before generating HTML/CSS/JavaScript code. This method significantly enhances functional correctness in multi-step interactions, making it a valuable tool for practitioners focused on UI automation and interactive web development.

arXiv cs.AI34 d agofound 16 d ago#video-to-code#UI#state-transition

AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming

AlgoSimBench is a newly introduced benchmark consisting of 402 multiple-choice questions designed to evaluate the ability of Large Language Models (LLMs) to identify algorithmically similar problems (ASPs) in competitive programming. The benchmark's unique setup pairs each reference problem with one ASP and three distractors, promoting reliance on algorithmic reasoning over superficial cues. The evaluation reveals that LLMs struggle with this task, but the proposed Attempted Solution Matching (ASM) technique, which assesses similarity based on LLM-generated solutions, improves accuracy by 9%, and when combined with BM25, achieves an additional 11.8% gain over existing embedding models. This benchmark is significant for advancing research on LLM capabilities and retrieval methods in algorithmic contexts.

arXiv cs.CL34 d agofound 13 d ago#llm#benchmark#competitive programming

CAOA -- Completion-Assisted Object-CAD Alignment

Completion-Assisted Object-CAD Alignment (CAOA) is introduced as a novel method for aligning CAD models with indoor RGB-D scans, addressing challenges posed by noise and segmentation errors. It combines a point cloud completion module with a symmetry-aware pose estimation algorithm, leveraging a newly developed synthetic data generation strategy tailored for indoor scenes to enhance real-world applicability. CAOA demonstrates a 17% accuracy improvement on the Scan2CAD benchmark and is supported by the release of S2C-Completion, a dataset of over 8,500 annotated object-CAD pairs, setting a new standard for alignment tasks in 3D semantic reconstruction.

arXiv cs.AI34 d agofound 13 d ago#CAD#alignment#3D reconstruction

Automated Semantic Fault Localization in SysML v2: A Human-in-the-Loop Framework Using Knowledge-Graph Augmented LLMs

The paper introduces a human-in-the-loop framework for automated semantic fault localization in SysML v2, integrating a fine-tuned Small Language Model (SLM) with a domain knowledge graph to identify and repair semantic errors that syntactically valid but violate domain rules. Specifically, the framework utilizes two models, Qwen2.5-Coder-1.5B and DeepSeek-Coder-6.7B, achieving a significant improvement in fault repair effectiveness from under 3% to over 91% on 1,184 test samples, while also reducing output token length by over 60%. This approach enhances model-based systems engineering (MBSE) tools by providing AI-assisted verification capabilities that maintain human oversight in the design process.

arXiv cs.AI34 d agofound 15 d ago#semantic fault localization#knowledge graph

RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

RigorBench is introduced as the first benchmark specifically designed to assess process discipline in autonomous AI coding agents, focusing on five key pillars: Planning Fidelity, Verification Coverage, Recovery Efficiency, Abstention Quality, and Atomic Transition Integrity. The benchmark comprises 30 tasks across various categories and demonstrates that structured process discipline enhances process quality scores by an average of 41% and improves downstream outcome correctness by 17%. This release offers critical insights for practitioners, emphasizing the importance of engineering discipline in the development of reliable AI coding agents, alongside traditional outcome correctness metrics.

arXiv cs.AI34 d agofound 15 d ago#benchmark#agents#coding

AutoACSL: Synthesizing ACSL Specifications by Integrating LLMs with CPG-Based Static Analysis

AutoACSL is a new framework that combines large language models (LLMs) with Code Property Graphs (CPGs) to automate the synthesis of ACSL specifications for C programs. It utilizes static analysis to extract semantic features and generates structured prompts for LLMs, leading to a 98% success rate in specification generation and a 96% verification success with Frama-C/WP, significantly improving proof ratios by up to 51.7% compared to traditional code-only methods. This integration enhances the robustness and effectiveness of automated specification generation, which is crucial for practitioners involved in formal verification processes.

arXiv cs.AI34 d agofound 20 d ago#acsl#llm#code generation

AI-Assisted Help-Seeking Trajectories in Programming Education from an SRL-Informed Perspective

This study analyzes AI-assisted help-seeking trajectories in programming education, focusing on 1,290 student prompts linked to 17,190 code submissions from 71 students in introductory Python courses. It reveals that students predominantly use AI for reactive troubleshooting rather than self-regulated problem-solving, with distinct trajectory patterns affecting the number of code submissions but not significantly impacting task scores. This research highlights the importance of understanding how students interact with AI tools to optimize educational outcomes in programming.

arXiv cs.AI34 d agofound 20 d ago#ai#programming#education

Evaluating LLMs for Real-World Web Vulnerability Detection

This study benchmarks six large language models (LLMs) for their ability to detect web vulnerabilities in WordPress plugins, focusing on SQL injection, stored cross-site scripting, path traversal, and remote code execution. Notably, Claude Opus 4.6 achieved the highest detection rate at 63%, while open-weight MiniMax M2.5 performed comparably at 48%, and self-hosted Qwen 3.5 lagged at 35%. The findings highlight the impact of prompt design on detection efficacy and reveal that no model achieved consistent reporting across iterations, underscoring the challenges of using LLMs for real-world vulnerability detection and providing valuable insights for security practitioners.

arXiv cs.AI34 d agofound 16 d ago#llm#vulnerability detection#web security

Teaching LLMs String Matching, Backtracking, and Error Recovery to Deduce Bases and Truth Tables for the Combinatorially Exploding Bit Manipulation Puzzles

This paper introduces a novel approach to solving bit manipulation puzzles for the NVIDIA Nemotron Model Reasoning Challenge, addressing the limitations of Large Language Models (LLMs) in logical rule deduction. Key innovations include reframing logic-gate deduction as a base-selection task using string similarity, implementing backtracking and error recovery mechanisms, and employing bit tokenization with interactive reasoning to enhance model performance. The proposed method achieved over 96% validation accuracy, marking the highest performance in its category and demonstrating significant advancements for practitioners dealing with combinatorial logic challenges in AI.

arXiv cs.AI34 d agofound 20 d ago#llm#bit-manipulation#algorithm

Beyond Simpson's Paradox: A Cascade of Confounders in AI Agent Pull-Request Co-Authorship

The study published in arXiv examines the impact of human co-authorship on the merge rates of pull requests (PRs) across five AI coding agents, revealing a case of Simpson's Paradox. Analyzing 33,596 PRs from the AIDev dataset, it finds that while aggregate merge rates for human co-authored PRs are lower (53.8% vs. 79.8%), stratification by agent identity shows that agents like Copilot and Devin actually benefit from co-authorship, highlighting the importance of controlling for agent composition and PR structure in evaluations. This research underscores the necessity for practitioners to avoid relying on pooled statistics without stratification, as they may lead to misleading conclusions regarding the efficacy of AI agents in collaborative coding environments.

arXiv cs.AI34 d agofound 15 d ago#ai agents#pull requests#simpsons paradox

Judgment-Grounded Expansion for Peer Review Generation

The paper introduces "judgment-grounded expansion," a novel approach for automatic peer review generation that emphasizes human-AI collaboration. This method involves a structured generate-check-refine process where reviewers provide evaluative claims that the AI system expands into review comments. The authors address challenges in scalable evaluation and candidate set curation, demonstrating that conformal prediction effectively balances candidate set size and coverage, thereby laying the groundwork for future collaborative review generation systems.

arXiv cs.CL34 d agofound 13 d ago#review#generation#automation

Automated sign detection across the Electronic Babylonian Library: A large-scale dataset and end-to-end cuneiform OCR pipeline

A large-scale annotated cuneiform sign dataset has been developed alongside an end-to-end cuneiform OCR pipeline utilizing a Deformable Detection Transformer (DETR) model, evaluated with 173 and 106 classes. The system integrates automatic extraction, heuristic line grouping, and n-gram-based textual similarity evaluation, achieving 28-37% improvements on COCO-style detection metrics and processing 87,668 tablet fragments to yield nearly 2.9 million sign detections. This framework enhances the scalability and interpretability of cuneiform analysis, paving the way for future multimodal and linguistic modeling applications.

arXiv cs.CL34 d agofound 12 d ago#cuneiform OCR#computer vision#dataset

Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

Koshur Pixel is a newly released large-scale synthetic OCR dataset specifically designed for the Kashmiri language, consisting of 613,078 image-text pairs generated from the KS-PRET-5M corpus using the SynthOCR-Gen framework. The dataset features diverse fonts and text granularities, along with over 25 augmentation strategies to simulate real-world document conditions. This resource is crucial for developing OCR systems for low-resource languages, facilitating the digitization of Kashmiri texts and enhancing language technology applications in under-resourced linguistic contexts.

arXiv cs.CL34 d agofound 13 d ago#ocr#dataset#kashmiri

IfcLLM: Natural Language Querying of IFC Models through Complementary Relational and Graph Representations

IfcLLM is a framework designed for natural language querying of Industry Foundation Classes (IFC) models, integrating both relational and graph representations to optimize attribute retrieval and spatial reasoning. The model achieves first-attempt accuracy ranging from 93.3% to 100% across 30 query scenarios and employs an iterative retry-and-refine reasoning process to handle query failures autonomously. This approach allows for local deployment of an open-weight LLM, making it suitable for data-sensitive architecture, engineering, and construction (AEC) environments, enhancing accessibility to complex building information without the need for specialist knowledge.

arXiv cs.CL34 d agofound 12 d ago#querying#ifc-models#language-models

Transcribing Bengali Text with Regional Dialects to IPA using District Guided Tokens

The paper introduces the District Guided Tokens (DGT) technique for transcribing Bengali text with regional dialects into the International Phonetic Alphabet (IPA). By prepending district tokens to input sequences, the approach fine-tunes transformer-based models, notably achieving superior results with the ByT5 model compared to mT5, BanglaT5, and umT5, particularly in handling out-of-vocabulary words. This work emphasizes the necessity of integrating regional dialect information in natural language processing systems to address the phonological diversity in languages like Bengali.

arXiv cs.AI34 d agofound 13 d ago#transcription#bengali#ipa

Improving Engine Sound Analysis in Hot-Test Environments via a RAB-U-Net (Residual Attention Block U-Net) Noise Removal Method

The study introduces a Residual Attention Block U-Net (RAB-U-Net) for enhancing engine sound analysis by effectively removing background noise during hot tests on production lines. This deep learning model improves the accuracy of engine noise detection compared to traditional methods, demonstrating its potential for real-time applications in automotive diagnostics. The advancement is significant for practitioners as it leverages neural network architectures to enhance sound analysis, thereby improving product quality and performance assessments in manufacturing environments.

arXiv cs.AI34 d agofound 16 d ago#noise removal#deep learning#engine analysis

CNnotator: LLM-Guided Memory Safety Annotation Synthesis

CNnotator is a hybrid testing and verification tool that utilizes large language models (LLMs) to automatically synthesize memory safety annotations for legacy C code. The OpenAI o3 reasoning model achieved a 90% success rate on initial attempts and 97% overall in generating CN specifications for small-to-medium C programs, while GPT-4o achieved a 65% success rate on first attempts. This advancement indicates the potential for AI-assisted annotation to enhance memory safety in existing C codebases, facilitating migration to safer programming languages.

arXiv cs.AI34 d agofound 16 d ago#memory-safety#annotations#llm

NL2Scratch: An Executable Benchmark and Evaluation for Block-Based Programming

NL2Scratch is introduced as an executable benchmark for natural-language-to-Scratch code generation, featuring 311,648 parser-valid program pairs derived from actual Scratch projects paired with aligned natural language descriptions. It employs a new metric, Semantic Alignment Consistency (SAC), to evaluate the semantic agreement between descriptions and programs, revealing significant discrepancies between lexical similarity and semantic alignment in instruction-tuned and fine-tuned LLMs. This benchmark is crucial for practitioners as it highlights the limitations of traditional evaluation metrics and provides insights into common failure modes in NL2Code tasks, particularly in handling operational slots.

arXiv cs.AI34 d agofound 16 d ago#nl2code#benchmark#scratch

Formally Verified Code Synthesis for Structured Data Translation in a Medical Internet of Things

This article presents a code synthesis system that leverages a large language model (LLM) for structured data translation in Medical Internet of Things applications, specifically focusing on integrating a pulse oximeter into an existing network. The system incorporates a formal verification step to ensure that the generated code adheres to predefined requirements, enabling reliable translation between the device's JSON schema and the Fast Healthcare Interoperability Resources (FHIR) format. Experimental results indicate that the system consistently produces correct translations at a low cost, highlighting its potential for practitioners needing trustworthy code generation in healthcare settings.

arXiv cs.AI34 d agofound 16 d ago#llm#code#synthesis

CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference

The article presents CoDe-R, a two-stage framework designed to enhance binary decompilation using Large Language Models (LLMs). It features a 1.3B parameter backbone and introduces two key innovations: Semantic Cognitive Enhancement (SCE) for recovering algorithmic intent and a Dynamic Dual-Path Fallback (DDPF) mechanism for adaptive inference. CoDe-R achieves a new state-of-the-art on the HumanEval-Decompile benchmark, exceeding a 50% average re-executability rate, which is significant for practitioners focusing on improving the accuracy and reliability of decompiled code.

arXiv cs.AI34 d agofound 14 d ago#decompiler#LLM#code refinement

From Empirical Evaluation to Context-Aware Enhancement: Repairing Regression Errors with LLMs

The study presents RegressionBug4APR, a benchmark comprising 200 regression bugs from Java and Python, aimed at evaluating automated program repair (APR) techniques, particularly those utilizing large language models (LLMs). Traditional APR methods failed to address these bugs, while LLM-based approaches demonstrated a 1.6x improvement in repair success when enhanced with context from bug-inducing changes. This research underscores the importance of context-aware strategies in improving the efficacy of LLMs for fixing regression errors, providing valuable insights for practitioners in software debugging and APR development.

arXiv cs.AI34 d agofound 14 d ago#regression bugs#APR#benchmark

Self-Stigma Is Not a Monolith, but Generic Empathy Is: Persona-Conditioned LLM Support for People Who Use Drugs

A proof-of-concept study presents a persona-aware approach to LLM support for people who use drugs (PWUD), based on a four-persona typology derived from Latent Profile Analysis of self-stigma expressions on Reddit. The study demonstrates that sequential Bayesian and recurrent neural classifiers can effectively identify these personas, achieving a macro-F1 score of 0.74 with only 30 posts, outperforming traditional LLM baselines. The findings highlight a tension between persona-matched responses that drive behavioral change and the preference for generic empathy in evaluations, indicating the need for nuanced assessment rubrics in LLM-based stigma support.

arXiv cs.CL34 d agofound 12 d ago#llm#support#self-stigma

Text2DSL: LLM-Based Code Generation for Domain-Specific Languages

The paper introduces Text2DSL, a novel approach for automatic code generation in domain-specific languages (DSLs) from natural language descriptions, distinct from existing paradigms like Text-to-SQL. It presents the PolkitBench dataset, containing 4,204 validated natural-language-to-Polkit-rule pairs, and evaluates two mixture-of-experts models: GigaChat-10B-A1.8B and Nemotron-3-Nano-30B. Key findings reveal that incorporating structured context significantly enhances syntactic and structural validity, achieving up to 99.4% syntactic validity and a 95% increase in CodeBLEU scores, highlighting the importance of formal specifications in improving LLM performance for DSL code generation.

arXiv cs.AI34 d agofound 20 d ago#code_generation#domain_specific_languages#llm

Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance

The paper presents a novel framework called Hierarchical Reinforcement Learning with Language Instructions (HRLLI), which enhances sample efficiency in RL by using dynamically selectable natural-language instructions as guidance. HRLLI employs a two-level policy structure within a Select-to-Act paradigm, where a high-level policy selects relevant instruction pieces based on the current state, while a low-level policy executes actions conditioned on this guidance. Experimental results on the RTFM benchmark indicate that HRLLI significantly outperforms existing instruction-conditioned RL methods, highlighting the importance of adaptive instruction selection in complex decision-making environments.

arXiv cs.AI34 d agofound 15 d ago#reinforcement learning#code compliance#llm

Leveraging Large Language Models to Obscure Code Stylometry: A Comparative Study of GPT-3.5 and GPT-4

This study examines the use of Large Language Models (LLMs), specifically GPT-3.5 and GPT-4, to obscure code stylometry for authorship attribution and cybersecurity. It evaluates the models' ability to modify code while preserving functionality, employing various prompt engineering strategies, and demonstrates significant differences in effectiveness between single-shot and multi-shot prompting. The findings underscore the challenges in maintaining code integrity and the implications for authorship detection techniques in the context of advanced AI, which is crucial for practitioners in cybersecurity and software engineering.

arXiv cs.AI34 d agofound 15 d ago#stylometry#llm#gpt

Revelio: Cost-Efficient Agentic Memory Safety Vulnerability Detection For Repository-Scale Codebases

Revelio is an end-to-end framework designed for cost-efficient detection of memory safety vulnerabilities in large codebases, leveraging inexpensive large language models (LLMs) and lightweight static analysis. It generates executable Proof-of-Vulnerability to mitigate hallucination issues, confirming vulnerabilities with a deterministic sanitizer. Evaluated on seven production-quality projects and 100 CyberGym benchmark projects, Revelio identified 19 previously unknown vulnerabilities at a total cost of $300, outperforming existing coding agents on benchmarks, thus providing a scalable solution for practitioners in memory safety detection.

arXiv cs.AI34 d agofound 15 d ago#memory#safety#vulnerability

Codex logging bug may write TBs to local SSDs

A bug in Codex has been identified that may result in excessive logging, potentially writing terabytes of data to local SSDs. This issue could lead to significant storage consumption and performance degradation for users, highlighting the need for careful management of logging practices in AI applications. Practitioners should be aware of this bug to mitigate risks associated with data storage and system performance.

Hacker News35 d agofound 21 d ago#codex#bug#logging

Codex-maxxing for long-running work

Jason Liu demonstrates the use of Codex for maintaining context in long-running tasks, enabling the management of complex projects that require continuity beyond a single prompt. This approach leverages Codex's capabilities to enhance workflow efficiency, which is crucial for practitioners developing applications that necessitate sustained interaction and context retention in AI-driven environments.

OpenAI News35 d agofound 21 d ago#codex#context#project_management

sqlite-utils 4.0rc1 adds migrations and nested transactions

sqlite-utils 4.0rc1 introduces two significant features: support for database migrations and nested transactions. The migration functionality, derived from the sqlite-migrate package, allows users to define and execute schema changes programmatically using a Python API, enhancing the library's capabilities for managing SQLite databases. This release is crucial for practitioners as it simplifies database schema evolution and improves transaction handling, which are essential for building robust applications.

Simon Willison35 d agofound 21 d ago#sqlite#python#database

sqlite-utils 4.0rc1

sqlite-utils 4.0rc1 has been released, introducing support for migrations and nested transactions. This update enhances the functionality of sqlite-utils, allowing developers to manage database schema changes more effectively and handle complex transaction scenarios. These features are crucial for practitioners looking to implement robust data management solutions in their applications.

Simon Willison35 d agofound 21 d ago#sqlite#python#database

Vercel CEO: "Almost shocked" by how good GLM-5.2 is at coding

Guillermo Rauch, CEO of Vercel, expressed strong admiration for the coding capabilities of GLM-5.2, indicating a significant advancement in its performance. While specific technical details such as model size or benchmark results were not provided, the statement underscores the model's potential impact on software development practices. This level of performance may influence practitioners to integrate GLM-5.2 into their workflows for improved coding efficiency and effectiveness.

Reddit r/LocalLLaMA36 d agofound 21 d ago#glm#coding#vercel

Solving Wordle using information theory

The article discusses the application of information theory to optimize strategies for solving Wordle. It analyzes the game's mechanics through the lens of entropy and information gain, proposing methods to select guesses that maximize the reduction of uncertainty about the target word. This approach can enhance the efficiency of algorithms designed for word-guessing games, offering insights for practitioners focused on game theory and natural language processing.

Hacker News36 d agofound 12 d ago#wordle#information-theory

Show HN: Tiny – An interpeted dynamic langauge with inline Go native functions

Tiny is an interpreted dynamic language that allows developers to write inline Go native functions, facilitating seamless integration of Go's performance with dynamic scripting capabilities. This hybrid approach enables practitioners to leverage Go's efficiency while maintaining the flexibility of dynamic languages, potentially enhancing the performance of applications that require both rapid development and execution speed. The ability to call native functions directly may streamline workflows for developers working on performance-critical applications.

Hacker News36 d agofound 22 d ago#dynamic-language#go#show-hn

You can now convert EXL3 quants on Apple Silicon Mac

EXL3 quantization, previously limited to CUDA environments and high-end RTX cards, is now accessible on Apple Silicon Macs, allowing users with 64GB+ memory to run and convert these models. Notably, the MiniCPM5 and Qwen3.6-27B models show competitive performance with mean KLD metrics comparable to those processed on RTX hardware. This development enhances the accessibility of high-fidelity quantization for practitioners, enabling more efficient deployment of models on consumer-grade hardware.

Reddit r/LocalLLaMA36 d agofound 22 d ago#exl3#macos#quantization

Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS?

The discussion compares Qwen3.6-27B@BF16, a dense model operating at full precision, with Step3.7@IQ4_XS, a mixture of experts (MoE) model featuring 7x more parameters but suboptimal quantization. The user expresses frustration with Qwen-Coder-Next@Q8's performance in programming tasks, highlighting issues with suboptimal code generation and maintainability, raising questions about the decision-making capabilities of these models in practical applications. This comparison is relevant for practitioners evaluating model trade-offs between precision, memory usage, and practical output quality in AI programming tasks.

Reddit r/LocalLLaMA36 d agofound 22 d ago#qwen#step

Show HN: We post-trained a model that pen tests instead of refusing

The article discusses the development of a model specifically designed for penetration testing, diverging from the typical behavior of refusing to engage in such activities. It highlights the implications of post-training a language model to perform security assessments, which could enhance the capabilities of AI in identifying vulnerabilities. This approach may provide practitioners with a tool that can assist in automated security evaluations, potentially streamlining the penetration testing process.

Hacker News36 d agofound 22 d ago#ai#pen-testing#model

Best Settings for 48GB VRAM + Qwen 3.6 27B

The article discusses optimal settings for running the Qwen 3.6 27B model on a dual-GPU setup comprising an RTX 4090 and RTX 3090, totaling 48GB of VRAM. Key configurations include using Q8_0 quantization, tensor split mode, 250k context length, and enabling speculative decoding with draft MTP, achieving performance metrics of 75-100t/s token generation and 1500 tokens per request. These settings are significant for practitioners as they maximize resource utilization and performance in high-demand AI applications.

Reddit r/LocalLLaMA37 d agofound 22 d ago#qwen#llm#vram#settings

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

The article discusses the successful execution of the Qwen 3.6 27B model with a context length of 131k on a 7900XTX GPU with 24GB of VRAM, utilizing techniques to optimize memory usage. By configuring the system to use integrated graphics for booting and employing kvcache quantization at Q5_0/Q4_0, the implementation achieves approximately 55-60 tokens per second while reducing VRAM usage by 12% compared to Q8. This setup is significant for practitioners as it demonstrates a practical approach to maximizing context length in large language models while managing VRAM constraints effectively.

Reddit r/LocalLLaMA37 d agofound 22 d ago#qwen#llm#context

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

This article presents a vision-based automated scoring system for student-generated scientific drawings, utilizing a Vision Transformer (ViT) with parameter-efficient adaptation. The proposed confidence-aware scoring framework assesses response-level confidence from predictive distributions, allowing for automated scoring of high-confidence responses while deferring uncertain cases for human evaluation. This approach enhances scoring reliability and supports a balance between automation and assessment accuracy, which is crucial for scalable educational applications aligned with the Next Generation Science Standards (NGSS).

arXiv cs.AI38 d agofound 24 d ago#automated assessment#education#vision

Controlled Comparison of Machine Learning Models for Fault Classification and Localization in Power System Protection

This paper presents a controlled comparison of machine learning models for fault classification (FC) and fault localization (FL) in power systems, utilizing a common electromagnetic transient dataset with decision windows of 10-50 ms. The top-performing nonlinear models for FC achieve F1 scores exceeding 0.98 at 10 ms, while FL models attain a stable localization error of approximately 10% of normalized line length, highlighting the importance of decision timing and topology in model performance. These results establish a standardized reference for evaluating machine learning approaches in power system protection tasks, crucial for practitioners aiming to enhance reliability in modern, complex power networks.

arXiv cs.AI38 d agofound 23 d ago#fault classification#machine learning#power systems

Denoising Implicit Feedback for Cold-start Recommendation

The paper introduces a model-agnostic denoising method called DIF for improving cold-start recommendations by addressing the noise in implicit feedback. DIF infers pseudo-labels for cold items using content-similar warm items, enhances label accuracy through confidence modeling based on content similarity, and estimates label uncertainty to adaptively correct noisy samples. This approach has demonstrated significant improvements in commercial metrics when deployed in a billion-user short video application, highlighting its practical relevance for practitioners dealing with cold-start scenarios in recommendation systems.

arXiv cs.AI38 d agofound 24 d ago#recommendation#denoising#implicit-feedback

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

The article introduces ZeSTA, a domain-conditioned training framework for zero-shot text-to-speech (ZS-TTS) aimed at enhancing personalized speech synthesis in low-resource settings. By employing a lightweight domain embedding and real-data oversampling, ZeSTA effectively mitigates speaker similarity degradation during fine-tuning without altering the base architecture. Experimental results on LibriTTS and an in-house dataset indicate that this approach improves speaker similarity while maintaining intelligibility and perceptual quality, making it a valuable technique for practitioners in personalized TTS applications.

arXiv cs.AI38 d agofound 22 d ago#tts#data-augmentation#speech-synthesis

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore is a new psychometrically-aware framework for Automated Essay Scoring (AES) that integrates scoring and instructional feedback through a shared latent ability representation. It features a Trait-Adaptive Neural IRT Scorer utilizing the Graded Partial Credit Model (GPCM) for precise student ability estimation, a ZPD-Scaffolded Feedback Generator that adapts feedback based on diagnosed proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy for assessing feedback quality. Experiments on the ASAP++ dataset show that PsyScore not only achieves competitive scoring performance but also offers feedback that is more aligned with pedagogical needs, making it significant for practitioners seeking to enhance the effectiveness of LLM-based educational tools.

arXiv cs.CL38 d agofound 22 d ago#essay scoring#feedback#llm

Source-Grounded Data Generation for Text-to-JSON Learning

The article introduces STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a data generation pipeline designed to create structured JSON outputs from unstructured text using large language models (LLMs). Evaluations on the STAGE-Eval benchmark, which includes 851 examples, demonstrate significant performance improvements for the Qwen3-4B model, with exact match rates increasing from 31.37% to 74.27% and value accuracy rising from 45.46% to 90.69%. This advancement is crucial for practitioners as it enhances the reliability and scalability of training data for text-to-JSON tasks, facilitating better integration of unstructured data into automated systems.

arXiv cs.CL38 d agofound 22 d ago#text-to-json#data generation#llm

Multi-View Decompilation for LLM-Based Malware Classification

The article presents a study on multi-view decompilation for enhancing malware classification using large language models (LLMs). It introduces a benchmark of benign and malicious binaries decompiled with both Ghidra and RetDec, demonstrating that utilizing multiple decompiler outputs significantly improves the F1 score for malicious classifications, primarily by increasing recall on malicious samples. This approach suggests that multi-decompiler prompting can serve as an effective, training-free method for practitioners to enhance the accuracy of LLM-based malware analysis.

arXiv cs.AI38 d agofound 23 d ago#malware#decompilation#llm

DataMagic: Transforming Tabular Data into Data Insight Video

DataMagic is an end-to-end interactive system designed to convert raw tabular data and natural language queries into narrative data-insight videos, addressing the limitations of existing visualization tools. It introduces the declarative specification DVSpec to ensure data fidelity by linking visual elements to data fields and employs a Generate-then-Orchestrate multi-agent architecture for efficient scene generation and narrative coherence optimization. This system enhances data consumption by providing three interaction modes and structured provenance-based data Q&A, making it a valuable tool for practitioners aiming to create dynamic data narratives.

arXiv cs.AI38 d agofound 23 d ago#data-visualization#narrative#insights

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology

This study presents an empirical framework for optimizing large language models (LLMs) in the context of construct identification in psychology through prompt engineering. It evaluates five prompting strategies, revealing that the most effective approach combines codebook-guided empirical prompt selection with automatic prompt engineering for few-shot classification tasks. The findings emphasize the importance of construct definitions and task framing in prompt design, offering a systematic method for enhancing LLM alignment with expert judgments, which is crucial for practitioners in fields requiring precise classification.

arXiv cs.CL38 d agofound 22 d ago#prompt-engineering#llm#psychology#classification

Secure Coding Drift in LLM-Assisted Post-Quantum Cryptography Development: A Gamified Fix

The paper presents a novel socio-technical vulnerability model called Secure Coding Drift in Post-Quantum Cryptography (PQC), which addresses the risks associated with reliance on LLM-generated code that can lead to insecure implementations. It proposes a gamified framework that integrates adversarial evaluation, behavioral feedback, and security scoring into development workflows, transforming LLMs from passive tools into active security co-pilots. This approach is crucial for practitioners as it aims to enhance secure coding practices in the complex landscape of PQC development.

arXiv cs.AI38 d agofound 23 d ago#post-quantum-cryptography#secure-coding#llm

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

The paper introduces a method for repurposing a conventional speech classifier to serve as the backbone for guided diffusion-based speech generation, eliminating the need for a separate classifier and diffusion model. By utilizing a frozen noise-conditioned classifier in log-Mel space and attaching a lightweight subnetwork trained under a Denoising Score Matching objective, the approach achieves high speech quality while significantly reducing memory and computational costs. This advancement is significant for practitioners as it streamlines the model architecture for conditional speech synthesis, enhancing efficiency without compromising performance.

arXiv cs.AI38 d agofound 23 d ago#speech-generation#diffusion#classifier

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

The article introduces JamSet and JamBench, the first project-level code framework dataset and benchmark for professional game engines, derived from over 240,000 open-source projects from Game Jam competitions. Utilizing the Godot engine, the dataset includes 8,133 verified projects, with 300 manually validated for JamBench, which evaluates theme-driven generation and code completion tasks through metrics like Structural Completeness Score (SCS) and Behavioral Alignment Score (BAS). The findings highlight a significant decline in runtime pass rates as project size increases, indicating that architectural design challenges are a key barrier for AI models in game development, making this dataset crucial for advancing research in AI-driven game coding.

arXiv cs.CL38 d agofound 22 d ago#game-development#dataset#benchmark#code-generation

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

The paper presents a framework for hardware generation that integrates large language models (LLMs) with formal methods to ensure correctness in register-transfer level (RTL) design. It introduces a set of transformation rules that guide the LLM in converting design specifications into RTL code while minimizing errors. This approach addresses the concerns of hardware engineers regarding LLM hallucinations, making it a significant advancement for practitioners in chip design and manufacturing who require reliable and interpretable outputs.

arXiv cs.AI38 d agofound 23 d ago#hardware-generation#llm#rtl#formal-methods

Ensembles of Large Language Models for Identifying EQ-5D Studies in PubMed Based on Their Abstracts

This study presents a multi-phase framework utilizing Google's Gemini and Gemma large language models to automate the identification of EQ-5D studies in PubMed based on abstracts. The ensemble of gemini-2.5-pro, gemma-3-12b, and gemma-3-27b achieved a weighted F1-score and accuracy of 0.74, demonstrating improved precision and recall over individual models. This approach highlights the potential for ensemble-based LLMs to enhance efficiency and reliability in systematic literature reviews within biomedical research.

arXiv cs.AI38 d agofound 23 d ago#LLM#PubMed#EQ-5D

PCBSchemaGen: Reward-Guided LLM Code Synthesis for Printed Circuit Boards (PCB) Schematic Design with Structured Verification

PCBSchemaGen is a novel framework that enables the synthesis of PCB schematics using a frozen 31B parameter LLM (Gemma-4-31B) by employing a structured verification approach. It integrates a domain schema derived from IC datasheets with a 5-layer continuous-reward verifier for pin-level error localization, achieving an average pass rate of 81.3% on 227 real-IC tasks across multiple circuit domains. This framework's ability to refine LLM outputs in the absence of traditional unit-test oracles presents a significant advancement for practitioners working on code synthesis in specialized areas like PCB design.

arXiv cs.AI38 d agofound 23 d ago#code synthesis#pcb design#llm

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass is a multi-agent framework designed for compiler performance tuning that leverages evidence from both the compiler and runtime to guide optimization decisions made by Large Language Models (LLMs). It allows LLMs to interact with the compiler's internal states and analyze intermediate representations without requiring offline training or fine-tuning, making it adaptable to various benchmarks and platforms. Evaluated on the LLVM compiler, AutoPass achieved geometric-mean speedups of 1.043x on x86-64 and 1.117x on ARM64 compared to LLVM's -O3, demonstrating its effectiveness over traditional auto-tuning methods.

arXiv cs.AI38 d agofound 23 d ago#compiler#llm#performance-tuning

Library-Aware Doubles and Iterative Repair for Large Language Model-Generated Unit Tests in OpenSIL Firmware

The study presents an automated unit test (UT) authoring workflow for the Open-Source Silicon Initialization Library (openSIL) firmware, utilizing a multi-agent pipeline guided by a large language model (LLM). Key technical features include library-aware generation of test scaffolds and an iterative compile-dispatch repair loop, which significantly improved compilation success rates to 96% across 76 functions and achieved mean line coverage of 98.8% under optimal conditions. This approach is crucial for practitioners as it enhances the efficiency and reliability of UT generation in low-level firmware development, addressing common challenges related to build constraints and manual debugging.

arXiv cs.AI38 d agofound 23 d ago#unit tests#firmware#LLM

Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

Multi-LCB is a newly introduced benchmark that extends the LiveCodeBench (LCB) framework to evaluate large language models (LLMs) across twelve programming languages, including Python. It adapts Python tasks from LCB into equivalent tasks in other languages while maintaining contamination controls and evaluation protocols, allowing for systematic assessment of multilingual code generation capabilities. The evaluation of 24 LLMs revealed issues such as Python overfitting and significant performance disparities across languages, highlighting the need for improved generalization in LLMs for real-world software engineering tasks.

arXiv cs.AI38 d agofound 24 d ago#benchmark#code generation#multi-language

Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks

The article outlines a comprehensive workflow for utilizing Salesforce CodeGen, which is available on Hugging Face, to generate and validate Python functions. Key features include function extraction, syntax checking, static safety checks, and unit-test validation, along with a reranking mechanism for best-of-N candidates and multi-turn program synthesis. This tutorial is significant for AI practitioners as it enhances the reliability and safety of code generation tasks, enabling the creation of more robust AI applications.

MarkTechPost38 d agofound 24 d ago#codegen#python#unit-tests

Source: Elastic agrees to buy CRV-backed Deductive AI for up to $85M

Elastic has agreed to acquire Deductive AI, a startup focused on using AI for software bug detection and resolution, for up to $85 million. This acquisition may enhance Elastic's capabilities in automated software testing and debugging, which is crucial for improving software reliability and efficiency in AI-driven applications.

TechCrunch AI38 d agofound 24 d ago#ai#software#bug detection

Anthropic brings Artifacts to Claude Code, letting teams share live pages from coding sessions

Anthropic has introduced a feature called "artifacts" to Claude Code, enabling users to convert coding session results into interactive web pages that can be shared with teams. These artifacts automatically update with changes and maintain a version history, enhancing collaboration by providing real-time access to session context. This feature is significant for practitioners as it streamlines the sharing of code outputs and fosters collaborative development workflows in AI projects.

The Decoder38 d agofound 25 d ago#claude#artifacts#coding-sessions

Multilingual-Multimodal-NLP/LoopCoder-V2 · Hugging Face

LoopCoder-V2 is a 7 billion parameter instruction-tuned code model based on the Parallel Loop Transformer (PLT), designed for efficient test-time computation scaling. It achieves significant performance improvements in code generation and reasoning tasks, with the optimal configuration involving two loops that enhance latent refinement without incurring diminishing returns seen with additional loops. This model, trained on 18 trillion tokens of mixed text and code, is crucial for practitioners focused on developing efficient AI systems in multilingual and multimodal contexts.

Reddit r/LocalLLaMA39 d agofound 29 d ago#multimodal#loopcoder-v2#huggingface

TREX: An AI code reviewer that runs your code

TREX is an AI-powered code reviewer that not only analyzes code but also executes it to provide real-time feedback. This tool aims to enhance the code review process by integrating execution capabilities, allowing for dynamic testing and validation of code snippets. Its practical application can significantly improve the accuracy of code assessments and streamline the development workflow for practitioners working with AI-driven software development tools.

Hacker News39 d agofound 25 d ago#ai#code#reviewer

Show HN: Microcrad – Micrograd Reimplemented in C

Microcrad is a reimplementation of the Micrograd library in C, aimed at providing a lightweight and efficient alternative for automatic differentiation. This version retains the core functionalities of the original Micrograd, allowing for gradient-based optimization in neural networks. The C implementation may offer performance benefits for practitioners needing lower-level control and efficiency in their AI models.

Hacker News39 d agofound 22 d ago#micrograd#show-hn

findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

The article introduces findsylls, a language-agnostic toolkit designed for syllable-level speech tokenization and embedding, which integrates various syllable detection methods under a unified interface. It supports syllable segmentation, embedding extraction, and multi-granular evaluation, facilitating controlled comparisons of algorithms and representations. This toolkit is significant for practitioners as it standardizes syllabification processes across diverse languages, enhancing reproducibility and enabling research in both high-resource and under-resourced linguistic contexts.

arXiv cs.AI40 d agofound 25 d ago#speech tokenization#embedding

DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL

DecoSearch is a new training-free framework for improving text-to-SQL translation, specifically targeting complex queries that require multi-step reasoning. It utilizes a lightweight Schema Selector to identify relevant database components, an LLM Judger to determine the need for query decomposition, and a Directed Acyclic Graph (DAG) to manage atomic sub-questions, achieving 70.53% accuracy on the BIRD dataset and 88.31% on Spider. This model-agnostic approach enhances existing SQL generation models without altering their architecture, significantly reducing token consumption compared to other methods, which is crucial for practitioners aiming for efficient and effective query processing.

arXiv cs.AI40 d agofound 29 d ago#text-to-sql#llm#routing

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

The article introduces Brick-DICL, a dynamic in-context learning framework designed for automated classification of the Brick schema in Building Management Systems (BMS). It features a two-stage architecture comprising metadata-RAG for enhancing LLM domain knowledge and class-RAG for narrowing classification options among 936 Brick classes, along with a multi-LLM filtering mechanism to improve prediction confidence. This approach significantly enhances classification accuracy and reduces manual verification effort, facilitating faster integration of standardized BMS across diverse datasets, which is crucial for interoperability in smart building technologies.

arXiv cs.AI40 d agofound 29 d ago#llm#code reasoning#diagnostics

From Paper to Program: Knowledge Externalization for AI-Assisted Quantum Many-Body Code Generation

The article presents a novel multi-stage workflow for knowledge externalization in AI-assisted code generation for quantum many-body physics, addressing the fragility of paper-to-program translation due to implicit conventions. The workflow was evaluated on two tasks: DMRG and Pfaffian conversion, achieving a significant improvement in validation rates (16/16 for DMRG with specifications versus 6/13 without, and 11/26 for Pfaffian-MPS compared to 0/26). This approach enhances the reliability of generating scientific code from literature, providing a structured protocol that aids practitioners in implementing complex algorithms while identifying points of failure in the externalization process.

arXiv cs.AI40 d agofound 25 d ago#quantum computing#code generation

Querying an astronomical database using large language models: the ALeRCE text-to-SQL system

The ALeRCE text-to-SQL system leverages large language models (LLMs) to facilitate querying the ALeRCE astronomical database through natural language, translating it into executable SQL queries. The system incorporates a four-module framework—schema linking, query classification, prompt decomposition, and self-correction—and was evaluated using a dataset of 110 natural language/SQL pairs. Notably, Claude Opus 4.6 achieved a perfect-match performance of 0.97 and 0.94 for row and column identifiers on simple queries, with performance declining for more complex queries, highlighting the importance of model selection and architectural enhancements in improving text-to-SQL capabilities for practical applications in data querying.

arXiv cs.AI40 d agofound 28 d ago#text-to-sql#llm#database-querying