Spinhance — reading coupling constants out of low-field ¹H NMR

The problem

Cheap spectrometers see overlapping peaks. We recover the real parameters.

A 90 MHz benchtop ¹H NMR spectrum is rarely first-order: chemical-shift differences are comparable to coupling constants, so multiplets overlap and naive peak-picking fails. Spinhance trains a neural network to invert a normalized low-field spectrum back to the underlying spin system — chemical shifts δ (ppm), scalar couplings J (Hz), and proton degeneracies. Once you have those field-independent parameters, the spectrum can be re-simulated exactly at any field strength.

★ View on GitHub How it works ↓

The pipeline

Four stages, from molecule to model.

We scope the problem to molecules with exactly eight hard-equivalent (chemically + magnetically equivalent) spin groups, and build a fully synthetic, physically-plausible training set end to end.

TASK 1 · GENERATE

Screen molecules

Filter large public databases (ChEMBL / PubChem) with RDKit, assigning chemically & magnetically equivalent proton groups, and keep molecules with exactly 8 spin groups.

TASK 2 · SPIN SYSTEM

Heuristic shift + J

Embed each SMILES in 3D, then estimate ²J/³J/⁴J couplings and shifts with Karplus and tabulated rules — a spin-system "shift+J" graph that could have come from a real molecule.

TASK 3 · SIMULATE

Exact spin sim

Diagonalize the spin Hamiltonian to produce accurate spectra at 90 and 600 MHz — a fast pure-Python engine, validated against MestreNova. This animation is that engine, live.

TASK 4 · MODEL

Spectrum → matrix

Train a network mapping a normalized 2¹⁴-point spectrum back to the 8×9 shift+J+degeneracy block — invariant to the arbitrary labeling of groups (S₈ permutation).

The screening dataset

The 8-spin library.

Each molecule has exactly eight magnetically distinct ¹H spin groups (A–H). Hover the table or the 3D structure to see which atoms belong to which group. Explore the dataset here →

Loading…

—

Spin groups

Group		Equivalence	Protons

SMILES —

The representation

Molecular spin systems are labeled, undirected graphs.

The spin graph is the model's learned intermediate representation — a labeled, undirected graph that sits between the raw spectrum and the output matrix. Nodes carry a chemical shift δ (ppm) and a proton degeneracy n. Edges carry a scalar coupling J (Hz) and a soft-equivalence flag indicating chemically equivalent but magnetically inequivalent protons. Stored as a symmetric 8×8 matrix with shifts on the diagonal and an 8×1 degeneracy vector — defined only up to permutation of the 8 labels (S₈). The structure on the left and the matrix on the right are two views of the same molecule now in the hero.

3D structure rotate

building 3D structure…

spin graph

Showing the molecule currently in the hero animation. Node labels: chemical shift δ (ppm) / degeneracy n. Edge thickness and labels: coupling J (Hz); only |J| > 0.5 Hz shown.

Where we are

How the model came together — architecture, data, and scale.

Each stage introduced a different innovation — first in architecture and loss, then in the dataset itself, and finally in scaling data and model capacity together. Metrics are on a leakage-controlled held-out test set the model never saw. Explore the learning curves, a full run comparison, and per-molecule test predictions ↓

Note (corrected dataset): the dataset was regenerated to give diastereotopic protons independent shifts (they were previously merged), and the whole recipe ladder is being retrained on this corrected data across all three tiers. The quantitative model metrics below are being regenerated and will update as the retrained checkpoints land — treat current numbers as pending.

Stage 1 of 4

CNN Baseline

Baseline · sessions 015–020

Dense CNN + typed heads

The first trainable model. A ResNet-1D encoder reads the 16,384-point spectrum and global-average-pools to an embedding. Four independent heads decode it: regression to shifts (ppm), regression to coupling magnitudes (Hz), binary logits for coupling presence, and classification over the degeneracy vocabulary.

Architecture:ResNet-1D → global avg pool → 4 typed heads

Permutation:Canonical sort by shift↓, degeneracy↓, |J| row-sum↓ — deterministic S₈ resolution

Loss:Stage 1: matrix only (Huber shift + presence BCE + deg CE). Stage 2: adds frozen surrogate spectral consistency (W₁ + cosine, weight ramped 0→0.6 over epochs 20–30)

LR schedule:WSD — hold peak 3e−4 through ~epoch 63, then cosine decay to a raised floor (1.2e−4), not zero

Key finding: spectral loss alone diverges — the 90 MHz inverse problem is under-determined. The matrix anchor is essential; spectral consistency is a refinement signal, not an identification signal. Keeping the LR high through the spectral-learning phase (WSD schedule) improved rare-class degeneracy accuracy most.

0.279

ppm shift MAE
Hungarian-matched

1.80

Hz coupling J MAE

0.807

coupling presence F1

0.732

degeneracy balanced accuracy
rare-class problem unsolved

Infrastructure · Branches 4–5

Differentiable surrogate renderer

To add a spectral-consistency loss the matrix→spectrum simulator must be inside the autograd graph. We ported the pyspin composite engine to PyTorch: total-spin manifold reduction + Mz block-diagonalization + Lorentzian-broadened FFT, all differentiable.

Core challenge:Exact CH₃/t-Bu degeneracies create eigenvalue gaps of ~1e−13 Hz — the 1/(λᵢ−λᵤ) VJP term diverges → NaN

Fix:Lorentzian-regularized VJP: F_ij = ΔE/(ΔE²+ε²), ε tied to linewidth (~1 Hz). Verified vs finite differences to ~1e−6

Composite reduction:Avoids the 2^N explicit expansion; largest Mz block ≤ 256 for 91% of molecules, ≤ 1024 for 99%

Role:Frozen teacher for Stage 2 spectral-consistency loss; also used as eval metric and post-hoc probe

Retrained:Field-specialized renderers retrained on PubChem with a pseudo-Voigt lineshape (fit to a real 600 MHz instrument): a 90-MHz surrogate (500k) → cos 0.998 (up from 0.986), and a separate 2¹⁶ 600-MHz renderer (1M) → cos 0.993, for high-field visualization

Key finding: the composite reduction structurally avoids most exact degeneracies (never builds permutation-degenerate states). The regularized backward is cheap insurance. At 90 MHz, essentially the whole 64k dataset is Stage-2 renderable — vs 89% with explicit 2^N expansion.

0.9999

forward corr vs pyspin oracle
incl. 14-spin t-Bu system

~1e−6

gradient error vs finite differences
on degenerate systems

0.998

cosine similarity @ 90 MHz
90-MHz surrogate retrained on 500k PubChem, pseudo-Voigt lineshape

0.993

cosine similarity @ 600 MHz
2¹⁶ 600-MHz renderer, 1M PubChem

Architecture · the graph-output model

Spin graph decoder

The spin system is an unordered graph of 8 groups — so the model should output a graph. We replaced the dense CNN heads with a Transformer encoder–decoder: 8 learned spin-group queries attend over the spectrum encoding, then a symmetric pairwise edge head outputs couplings for every pair.

Architecture:ResNet-1D stem → ppm-positioned tokens → pre-LN Transformer encoder → 8 learned queries → Transformer decoder → node heads (δ, n) + symmetric edge head (J, presence)

Edge head:edge_ij = MLP([h_i+h_j, |h_i−h_j|]) — naturally symmetric, handles all 28 upper-triangle pairs

Loss:Canonical matrix anchor (shift wt 2×) + early-ramp surrogate spectral to 0.6. Hungarian matching tried and rejected — scrambled J coupling assignments (J 1.35→2.1 Hz, never recovered)

What didn’t work:Support-region tokenization (no gain, 2.5× slower); Hungarian loss; integration-aware aux loss (architecture solved degeneracy imbalance on its own)

Key finding: the structured decoder is the dominant win — a ~6× shift improvement over the CNN baseline. Canonical ordering is unambiguous for NMR data (distinct chemical shifts make the sort stable), so Hungarian’s extra freedom only hurts by scrambling the relational coupling structure across assignments.

0.047

ppm shift MAE
64k · held-out split

~6× CNN

1.49

Hz coupling J MAE

0.89

coupling presence F1

+0.08

0.87

degeneracy balanced accuracy
rare-class problem solved by architecture

+0.14

Refinement · standard in every model

Peak channel & soft equivalence

Two targeted innovations addressing the remaining qualitative failure modes: imprecise shift localization and accidentally degenerate groups rendering as spurious doublets instead of single peaks.

Peak channel:A 2nd conv input channel computed inside forward() — local maxima above a per-sample threshold, Gaussian-smoothed — as a shift-localization prior. Zero train/serve skew

Soft equivalence:A 3rd PairwiseEdgeHead logit flags accidentally-degenerate pairs (same δ, different J). BCE vs |δ_i−δ_j|≤tol + consistency penalty pulling flagged shifts together; hard-averaged at decode

Why it matters:Without the flag, degenerate pairs drift to e.g. 3.59/3.61 ppm and render a spurious split doublet — a qualitative failure that mean shift-MAE hides

In production:Both are default-on in every model in the current fleet (64k → 3M) and compose on any checkpoint without changing the output schema

Why it matters: these refinements target qualitative correctness — precise shift localization and correct peak multiplicity — which a mean shift-MAE can’t capture. The soft-equivalence flag pulls accidentally-degenerate groups onto a single shift so they render as one peak instead of a spurious split doublet.

0.047

ppm shift MAE
localization — peak channel

0.87

degeneracy balanced acc
multiplicity — soft equivalence

on

default in every model
64k → 3M fleet

Data · the 8-spin dataset

Clean data at scale

A model is only as good as its training set. We rebuilt the data end-to-end: every molecule reduced to exactly eight hard-equivalent ¹H groups, shifts and couplings assigned from first principles, and the values sampled rather than reused so the model learns the chemistry, not a lookup table. Explore the dataset →

Shifts:Experimentally-derived additivity constants — aromatic base + ortho/meta/para increments, Shoolery aliphatic rules, alkene gem/cis/trans, per-ring heteroaromatic bases

Couplings:Mechanism-specific J — geminal ²J, vicinal ³J, aromatic ring-position-aware ³/⁴J, olefinic cis/trans/gem, long-range / benzylic ⁴J

Dispersion:Class-aware Gaussian jitter (shift σ floor 0.15 ppm; sign-preserving ±25 Hz clamp) so values sample the real chemical space instead of fixed table constants

Topicity:Diastereotopic protons (e.g. a CH₂ next to a stereocentre) are identified by the 3D deuterium-substitution test, kept inequivalent, and given independent sampled shifts — never merged onto one value

Held-out split:A leakage-controlled global 10% test set — union-find clustering on matrix fingerprint + InChIKey groups near-duplicates, then a single seed-0 shuffle places the clusters and the last ~10% (whole clusters) is held out

Key finding: an earlier dataset reused the same handful of table values over and over, letting the model memorize them. Sampling each molecule’s shifts and couplings — and enforcing a leakage-controlled split — makes the held-out numbers honest rather than inflated by near-duplicates.

3.13M

molecules
PubChem, exactly 8 spin groups

8

hard-equivalent ¹H groups
chemical + magnetic equivalence

10%

leakage-controlled held-out
no model ever trains on it

Scaling · data × capacity

Scaling data & capacity

With clean data and a graph-output architecture, the remaining lever is scale. We train the same model at three data tiers and grow its capacity to match — the guiding rule being to increase model size as long as it isn’t overfitting.

Data tiers:64k → 500k → 3M training molecules, all evaluated on the same global held-out test set

Model size:Capacity scaled with the data — ~10M params at 64k (light), ~57M at 500k (med), ~137M at 3M (xl) — same spingraph_decoder, wider + deeper

Yardstick:A standardized held-out evaluation — every model scored on identical molecules it never saw, so tiers are directly comparable

Where we are: the full 025–030 recipe ladder is being retrained across all three tiers (64k, 500k, and 3M·025/027) on the corrected dataset — the diastereotopic-shift fix changed the training targets, so the prior numbers are superseded. Quantitative results are pending; the curves, held-out metrics, and per-molecule predictions repopulate the viewers as the retrained checkpoints land.

—

ppm shift MAE
retraining on corrected data · pending

10→57→137M

params per tier
64k / 500k / 3M

3

data tiers
all retraining · results pending

Explore the models

Interactive views into development & testing.

Training curves and the model comparison report validation-split metrics (used for model selection); the held-out test split — never seen in training or selection — is reported separately below and in the molecule explorer. New to the recipes (025–030)? Read how each model works →

Learning curves validation

Model comparison validation

click to sort

Held-out test-set evaluation test

compare across recipe (64k) or model size (026) on the held-out split

Held-out test-molecule explorer test

model

click a label to toggle its trace

3D structure

rotate

building 3D…

Pick a model to see its predictions on a shared set of held-out PubChem test molecules — the leakage-controlled global 10% split that no model trained on or selected against (the CNN baseline was trained on ChEMBL, so PubChem is also out-of-distribution for it). The rendered trace re-simulates the model’s predicted spin system through the exact quantum simulator — a close overlay means the recovered parameters reproduce the input spectrum. Predicted values > 0.1 ppm / 1.5 Hz off ground truth are shown in red.

The team

Built at the 2026 Scripps Hackathon.

LA

Lucas Abounader

Shenvi Lab · simulation & model

SM

Sam Mansfield

Seiple Lab · generation & model

YZ

Yiming Zhang

Shenvi Lab · spin-system generation

★ Repository

RDKitNumPy / SciPy PyTorchspin-Hamiltonian simulation