Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
Morpheus is a new morphology-aware neural tokenizer and word embedder specifically designed for the Turkish language, addressing the limitations of existing subword tokenizers that do not preserve morpheme integrity. Utilizing a differentiable Poisson-binomial dynamic programming approach, Morpheus achieves lossless tokenization with a bits-per-character rate of 1.425 and demonstrates superior morphological alignment with a MorphScore macro-F1 of 0.61, while also being more memory-efficient than traditional 64K-vocabulary tokenizers. For practitioners, Morpheus offers improved lexical retrieval and verification capabilities, making it a valuable tool for tasks involving agglutinative languages, although it may lag behind heavier contextual models in certain context-dependent tasks.