Training
Spokes: Optimizing for Diverse Pretraining Data Selection
The paper introduces SPOKES, a probabilistic diversification framework that optimizes for data diversity in pretraining datasets using the G-Vendi score and exponentiated gradient descent. SPOKES significantly enhances the diversity of selected subsets, achieving a +489 increase in G-Vendi score on a 500k-sample subset and improving downstream performance by +0.4 and +0.5 points over random sampling on DCLM and FineWeb, respectively. This approach demonstrates that jointly optimizing for quality and diversity yields the best results, outperforming existing methods, including semantic deduplication and quality filtering, which is crucial for practitioners aiming to enhance model robustness and generalization in AI applications.
data-selectionoptimization