Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese
The article introduces TOTEN, a knowledge-based ontological tokenization framework designed for Brazilian Portuguese that improves upon traditional Byte-Pair Encoding by incorporating a formal ontology of engineering entities. TOTEN's architecture includes a classification function and an instantiator family, which together enable structured representation of physical quantities and technical notation, achieving unit ontological atomicity and numerical reconstruction scores ranging from 0.775 to 0.904 against eight state-of-the-art baselines. This framework is significant for practitioners as it enhances the semantic understanding of technical content, improving the accuracy and robustness of AI applications in engineering and scientific domains.