Coding
CodeAlchemy: Synthetic Code Rewriting at Scale
CodeAlchemy is a synthetic data generation framework designed to enhance training data for code-related tasks by transforming publicly sourced code using five strategies, resulting in over 500 billion tokens of synthetic data and 350 billion reasoning tokens. The framework includes benchmarks such as DevEval and TraceEval, with 3B models achieving an 83.5% pass rate on HumanEval and outperforming larger models like 27B Gemma-3 and 32B Granite-4.0 by a factor of ten in certain tasks, highlighting the potential of synthetic data in improving semantic understanding in code generation and execution tasks for AI practitioners.
codesynthetic-datacode-rewriting