Training
Data Augmentations for Data-Constrained Language Model Pretraining
The paper presents a study on data augmentation techniques to address overfitting in autoregressive (AR) language model pretraining under data-constrained conditions. It introduces three types of augmentations: token-level noise, sequence permutations, and target offset prediction, demonstrating that these methods significantly delay overfitting and reduce validation loss, with random token replacement yielding the best results. This research is crucial for practitioners as it offers effective strategies to enhance training efficiency in scenarios where high-quality data is limited.
data augmentationlanguage modelpretraining