Reading NMR notes with a language model
We’ve developed a compact transformer that can read short NMR notes and return a small, credible shortlist of molecules.
In NMR, a sample is placed in a strong magnet and pinged with radio waves; the resulting spectra reveal how atoms are connected. Chemists condense these data into a few lines of text that we use as input.
Our model reads these notes, proposes candidate molecules as SMILES structures, ranks the candidates, and estimates whether the correct answer appears among the top results.
Predictions table
PPM explorer
Random SMILES generation
Slides (2025-09-05)
Slides (2025-11-13)
Slides (2026-01-28)
Bibliography (superset)
Implementation
Zenodo
Strychnine FID audio
More…
How to cite
R. Andreev, NMR spectroscopy with language transformers, MSc thesis, ETH Zürich, 2025, https://numpde.github.io/shared/msc/
@thesis{Andreev-2025-NMRTransformers,
author = {Andreev, R.},
title = {NMR spectroscopy with language transformers},
year = {2025},
publisher = {ETH {Z\"urich}},
url = {https://numpde.github.io/shared/msc/},
abstract = {NMR spectroscopy has extensive applications in (bio-)chemistry for characterizing molecular structures. We fine-tune the DistilGPT2 language transformer model to infer molecular structure from textual annotations of multimodal NMR spectra on the ~795k synthetic dataset by Alberts et al., 2024. The model is supervised on a multi-task objective that includes a vector embedding and functional group counts via auxiliary heads, as well as the next-token distribution w.r.t. all SMILES that serialize the target molecule. Inference with beam search shows ~80% top–10 accuracy on ~80% of the test set, which deteriorates with molecule size. To improve interpretability, we train data-informed confidence estimators with ~90% accuracy of whether a match is in the top–10. Selective heteromodal input fine-tuning suggests that the HSQC modality contributes more to the average accuracy than ¹³C–NMR, despite being faster to acquire experimentally.},
keywords = {NMR spectroscopy; language models; transformers; molecular structure prediction; multimodal learning; computational chemistry}
}