Research
Augmenting Molecular Language Models with Local $n$-gram Memory
The article presents MolGram, a novel architecture for molecular language models that integrates a conditional $n$-gram memory module to address locality gaps in standard character-level tokenization of SMILES strings. By mapping local string patterns to learned embeddings and injecting this context into hidden states, MolGram enhances performance in tasks like unconditional molecule generation and retrosynthesis, outperforming baseline models with three times more parameters. This approach offers practitioners a mechanism to efficiently incorporate local syntactic information without disrupting existing tokenization methods.
molecular modelsn-gram memorySMILES