Coding
Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment
The article presents a novel agentic judging pipeline leveraging a strong LLM to enhance architectural reasoning in software engineering, specifically using the Qwen models (3-8B, 14B, 32B). Fine-tuning these models on 3,360 curated instances resulted in a resolved rate of 27.2% on the SWE-bench Verified, marking significant improvements of up to 540% over the base model. This approach enables scalable architectural evaluation, which is crucial for practitioners aiming to improve code quality and maintainability in real-world development environments.
llmsoftware engineeringarchitectural reasoning