ai-digest.dev
last updated 4 h ago
ResearcharXiv cs.CL 16 d ago

CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

CzechDocs is a newly released multiway parallel dataset comprising formatted documents (HTML, DOCX, and PDF) in Czech and several minority languages, including Ukrainian and English. It aims to facilitate the evaluation of machine translation systems with a focus on maintaining document formatting, providing a validation subset and an evaluation toolkit for research purposes. This dataset is significant for practitioners working on document-level translation tasks, particularly those looking to enhance formatting preservation in machine translation systems.

machine translationdatasetczechrelevance 0.00 · engagement 0.00
Read at source ↗← all news
CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia — AI News Digest