Spinhance — the models, explained

How each model works.

Every Spinhance model solves the same inverse problem — read a blurry 90 MHz ¹H spectrum and recover the molecule's spin system: each proton group's chemical shift (δ), its scalar couplings (J), and how many equivalent protons it holds. They all share one neural-network backbone and differ only in two things: what extra hints we feed in, and how we shape the training loss. This page explains each, plainly.

The shared backbone — the spin-graph decoder

Every recipe below is the same network. The output isn't a picture or a number — it's a small graph (8 proton-group nodes + the couplings between them), so the network is built to emit exactly that.

1 · READ

Convolutional stem

A 1-D CNN scans the 16,384-point spectrum and turns local stretches of it into features — the rough "shapes" present at each chemical shift.

2 · RELATE

Transformer encoder

Self-attention lets every part of the spectrum inform every other part, so a multiplet at 7 ppm can be interpreted in light of one at 2 ppm.

3 · ASK

Eight group queries

Eight learned "spin-group" queries — one per proton group — attend back into the spectrum through a Transformer decoder, each gathering the evidence for its own group.

4 · ANSWER

Node + edge heads

Per-group heads read off the shift and degeneracy; a symmetric pair-wise edge head reads off the coupling between every pair of groups (and, in some recipes, whether two groups are equivalent).

This set-structured design fits the answer — eight unordered groups and a symmetric coupling matrix — far better than a plain CNN with fixed output slots, and it's why it beats the earlier CNN baseline several-fold on every metric. The recipes 025–030 all use this exact backbone.

The recipes

Each recipe is the backbone plus a specific, motivated change. Numbers below are held-out test error at the 64k scale (lower shift/J is better; higher F1/degeneracy is better).

025

The baseline

Predict the spin-system matrix straight from the spectrum, with the chemical-shift error weighted 2× because shifts are the hardest and most valuable target. No extra inputs, no special handling — this is the control every other recipe is measured against.

64k held-out: shift 0.047 ppm · J 1.07 Hz · F1 0.911 · deg 0.944

026

A peak map + equivalence

= 025 + two architecture ideas

Peak channel. A second input channel — computed from the spectrum inside the model — that highlights local maxima. A "where are the peaks?" hint, so the CNN doesn't rediscover them from scratch.
Soft-equivalence. Groups that are equivalent by molecular symmetry (e.g. two methoxys on a symmetric ring, or an AA′BB′ pair) must share one shift. The model flags such pairs, is penalized when their shifts drift apart, and averages them on output — so a degenerate pair renders as one clean peak, not a fake split doublet.

64k held-out: shift 0.046 ppm · J 1.06 Hz · F1 0.909 · deg 0.949

027

Focus on the hard cases

= 025 + focal loss

The data is lopsided: most groups are common (CH, CH₃) and most group-pairs aren't coupled, so a model can score well while being lazy on the rare cases. Focal loss down-weights the examples already predicted confidently, redirecting effort toward rare degeneracies (6H, 9H tert-butyls) and the sparse real couplings. Only the loss changes — same inputs as 025.

64k held-out: shift 0.037 ppm · J 0.97 Hz · F1 0.902 · deg 0.929

028

Count protons by area

= 025 + cumulative-integral channel

A bedrock NMR fact: a peak's area is proportional to the number of protons — exactly the degeneracy we want. But a local convolution sees peak shapes, not areas; it can't integrate. So we feed a second input channel — the spectrum's running integral — letting the model read relative areas directly and untangle the 2H-vs-1H confusion that shape alone can't.

64k held-out: shift 0.048 ppm · J 1.07 Hz · F1 0.907 · deg 0.941

029

Structure + focus

= 026 + 027

The peak channel and soft-equivalence of 026 together with the focal loss of 027 — testing whether sharpening the hard cases stacks on top of the structural priors. it posts the best degeneracy balanced-accuracy of the five (0.955).

64k held-out: shift 0.039 ppm · J 1.01 Hz · F1 0.889 · deg 0.955

030 · super model

Everything, together

= 026 + 027 + 028

All four ideas at once: the peak channel and soft-equivalence (026), the focal loss (027), and the cumulative-integral channel (028). Each targets a different weakness — peak localization, symmetry, class imbalance, and proton-counting — so combining them should give the strongest model. Currently training at all three sizes.

64k / 500k / 3M — training now

Three sizes, three data scales

Each recipe is trained at a matched model capacity and dataset size, so we can watch accuracy improve as both grow — the same network, just wider and deeper, on more molecules.

tier	model capacity	training molecules
64k	light · ~10M params	64,000 (fast turnaround)
500k	med · ~57M params	500,000 PubChem
3M	xl · ~137M params	full 3.13M PubChem

tier

model capacity

training molecules

64k

light · ~10M params

64,000 (fast turnaround)

500k

med · ~57M params

500,000 PubChem

xl · ~137M params

full 3.13M PubChem

Every model — every recipe at every size — is scored on the same leakage-controlled held-out test set (a global 10% of PubChem that no model trained on), so the comparisons are honest and directly readable.