Research
CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia
CzechDocs is a newly released multiway parallel dataset comprising formatted documents (HTML, DOCX, and PDF) in Czech and several minority languages, including Ukrainian and English. It aims to facilitate the evaluation of machine translation systems with a focus on maintaining document formatting, providing a validation subset and an evaluation toolkit for research purposes. This dataset is significant for practitioners working on document-level translation tasks, particularly those looking to enhance formatting preservation in machine translation systems.
machine translationdatasetczech