Every molecule is reduced to exactly eight magnetically-distinct ¹H spin groups, then turned into a chemical-shift / J-coupling spin graph and a simulated 90 MHz spectrum. Below: how that dataset is built, and the distributions it produces — hover any bar for counts.
computed over a —-molecule sample of the PubChem 8-spin set (≈3.13M total)RDKit filters ChEMBL / PubChem and assigns hard-equivalent proton groups (chemically and magnetically equivalent, via a deuterium-substitution test). Only molecules with exactly 8 distinct groups are kept.
Each group carries a degeneracy — its number of equivalent ¹H (1 for CH, 2 for CH₂, 3 for CH₃, 6 or 9 for symmetry-equivalent methyls). Out-of-vocabulary degeneracies are filtered out.
Chemical shifts come from experimentally-derived additivity constants1: an aromatic base of 7.34 ppm plus ortho/meta/para substituent increments, Shoolery aliphatic rules, alkene shifts (5.25 + gem/cis/trans), and per-ring heteroaromatic bases.
Scalar couplings are assigned by mechanism: geminal ²J, vicinal ³J, aromatic ring-position-aware ³/⁴J, olefinic cis/trans/geminal, and long-range / benzylic ⁴J — building the spin graph's edges.
Rather than reusing fixed table constants, each molecule's shifts and couplings get a class-aware Gaussian jitter (shift σ floor 0.15 ppm; sign-preserving ±25 Hz coupling clamp), so values sample the real chemical space and the model can't memorize them.
The sampled spin graph is simulated into a full second-order 90 MHz ¹H spectrum — spectrum and label stay consistent. That low-field spectrum is the sole input to the inverse model.