NMR spectroscopy with language transformers

Reading NMR notes with a language model

We’ve developed a compact transformer that can read short NMR notes and return a small, credible shortlist of molecules.

In NMR, a sample is placed in a strong magnet and pinged with radio waves; the resulting spectra reveal how atoms are connected. Chemists condense these data into a few lines of text that we use as input.

Our model reads these notes, proposes candidate molecules as SMILES structures, ranks the candidates, and estimates whether the correct answer appears among the top results.

Report

PDF • 2025-08-16

Open report (PDF)

Predictions table

Explore the test set from the report

Open table

PPM explorer

Interactive chemical shift browser

Open ppm explorer

Random SMILES generation

Visualization of rdkit's random-SMILES algorithm

Open doRandom demo

Slides (2025-09-05)

PDF (26MB) • 2025-09-05

Open slides (PDF)

Slides (2025-11-13)

PDF (33MB) • 2025-11-13

Open slides (PDF)

Slides (2026-01-28)

PDF (23MB) • 2026-01-28

Open slides (PDF)

Bibliography (superset)

JSON • 2025-09-27

Open bibliography (JSON)

Implementation

GitHub

Open GitHub repo (private)

Zenodo

Model checkpoints

Open Zenodo record

Strychnine FID audio

WAV (860KB) • 2026-03-23

Open WAV file

More…

TBD

How to cite

R. Andreev, NMR spectroscopy with language transformers, MSc thesis, ETH Zürich, 2025, https://numpde.github.io/shared/msc/

@thesis{Andreev-2025-NMRTransformers,
  author    = {Andreev, R.},
  title     = {NMR spectroscopy with language transformers},
  year      = {2025},
  publisher = {ETH {Z\"urich}},
  url       = {https://numpde.github.io/shared/msc/},
  abstract  = {NMR spectroscopy has extensive applications in (bio-)chemistry for characterizing molecular structures. We fine-tune the DistilGPT2 language transformer model to infer molecular structure from textual annotations of multimodal NMR spectra on the ~795k synthetic dataset by Alberts et al., 2024. The model is supervised on a multi-task objective that includes a vector embedding and functional group counts via auxiliary heads, as well as the next-token distribution w.r.t. all SMILES that serialize the target molecule. Inference with beam search shows ~80% top–10 accuracy on ~80% of the test set, which deteriorates with molecule size. To improve interpretability, we train data-informed confidence estimators with ~90% accuracy of whether a match is in the top–10. Selective heteromodal input fine-tuning suggests that the HSQC modality contributes more to the average accuracy than ¹³C–NMR, despite being faster to acquire experimentally.},
  keywords  = {NMR spectroscopy; language models; transformers; molecular structure prediction; multimodal learning; computational chemistry}
}