Spinhance — explore the 8-spin dataset

How the dataset is built

The heuristics behind every molecule — screening, shifts, couplings, equivalence and sampling.

01 · SCREEN

Exactly eight groups

RDKit filters ChEMBL / PubChem and assigns hard-equivalent proton groups (chemically and magnetically equivalent, via a deuterium-substitution test). Only molecules with exactly 8 distinct groups are kept.

02 · DEGENERACY

Equivalent protons collapse

Each group carries a degeneracy — its number of equivalent ¹H (1 for CH, 2 for CH₂, 3 for CH₃, 6 or 9 for symmetry-equivalent methyls). Out-of-vocabulary degeneracies are filtered out.

03 · SHIFTS

Pretsch additivity

Chemical shifts come from experimentally-derived additivity constants¹: an aromatic base of 7.34 ppm plus ortho/meta/para substituent increments, Shoolery aliphatic rules, alkene shifts (5.25 + gem/cis/trans), and per-ring heteroaromatic bases.

04 · COUPLINGS

Mechanism-specific J

Scalar couplings are assigned by mechanism: geminal ²J, vicinal ³J, aromatic ring-position-aware ³/⁴J, olefinic cis/trans/geminal, and long-range / benzylic ⁴J — building the spin graph's edges.

05 · DISPERSION

Sampled, not reused

Rather than reusing fixed table constants, each molecule's shifts and couplings get a class-aware Gaussian jitter (shift σ floor 0.15 ppm; sign-preserving ±25 Hz coupling clamp), so values sample the real chemical space and the model can't memorize them.

06 · SIMULATE

Second-order 90 MHz spectra

The sampled spin graph is simulated into a full second-order 90 MHz ¹H spectrum — spectrum and label stay consistent. That low-field spectrum is the sole input to the inverse model.

Explore the 8-spin dataset.