Scaffold hopping — replacing a problematic core structure with a new one that maintains binding but improves properties — used to be a manual exercise. A medicinal chemist would look at a crystal structure, identify the pharmacophore features responsible for binding, and mentally enumerate which alternative scaffolds could satisfy those features with different peripheral chemistry. That process was limited by the chemist's mental library and took weeks. Generative models have changed the throughput and the scope. They have not changed the fundamental problem: proposing a scaffold that satisfies pharmacophore geometry is much easier than proposing one that actually binds. This post is about the gap between those two things — what we call the hallucination problem in generative scaffold hopping — and how we've learned to identify which generated proposals are real and which are geometric coincidences.
What Generative Scaffold Hopping Actually Does
The generative approach to scaffold hopping decomposes the problem into two steps. First, a model learns to represent the pharmacophore — the spatial arrangement of hydrogen bond donors, acceptors, hydrophobic regions, and charged groups that characterizes the binding interaction. Second, a generative model (typically a VAE, diffusion model, or fragment-based graph model) proposes novel scaffolds that satisfy the pharmacophore constraint while differing maximally in 2D structural space from the starting scaffold. The pharmacophore constraint is the key claim: the model proposes structures that look different but preserve the 3D features that drive binding. In practice, the constraint is softer than this description implies. Pharmacophore matching in the generative model is based on learned representations, not on explicit docking into the protein structure. The model has internalized patterns from training data — it "knows" that certain pharmacophore patterns correlate with binding — but that knowledge is statistical, not mechanistic. When you ask it to hop to a scaffold far from training data, it proposes structures that pattern-match to binding without verifying that the proposed scaffold can actually adopt the required conformation in the binding site.
The Hallucination Pattern: Geometric Satisfaction Without Physical Realizability
The characteristic failure mode is what I call geometric satisfaction without physical realizability. The generative model proposes a scaffold where, in the lowest-energy conformation of the free molecule, the pharmacophore features align correctly. But when you attempt to dock the structure into the protein, one of three things goes wrong. First, the conformation required to maintain pharmacophore alignment in the protein is not the low-energy conformation of the free molecule — and the conformational energy penalty for adopting the binding conformation is large enough to eliminate the binding free energy gain. Second, the new scaffold has a steric clash that didn't exist in the original scaffold, because the peripheral atoms are in different positions even when the pharmacophore features align. Third, the new scaffold makes an implicit assumption about tautomeric state or protonation state that is wrong under physiological conditions, so the pharmacophore features exist only in a non-dominant tautomer. We see all three failure modes regularly. Conformational energy penalty is the most common, especially for generative models that propose bicyclic and tricyclic systems — they frequently satisfy pharmacophore in a strained conformation.
Detection: The Filter Stack We Apply
We apply a four-stage filter to generative scaffold hop proposals before any synthesis prioritization. Stage one is conformational profile analysis: for each proposed scaffold, we enumerate conformational minima and check whether the pharmacophore-satisfying conformation is within 2 kcal/mol of the global minimum. This is a hard filter — structures that require more than 2 kcal/mol conformational strain to achieve the binding conformation are rejected regardless of pharmacophore satisfaction. Stage two is explicit re-docking: every proposal that passes stage one is docked into the same protein structure used to characterize the original hit, with the pharmacophore constraint removed. The docking pose is evaluated both by score and by pose stability — we run short MD (10 ns) and measure RMSD drift. Structures that drift more than 2 Å from the initial docked pose in 10 ns are flagged as potentially unstable binders. Stage three is matched molecular pair analysis against the training set: if the scaffold hop is a type that has been attempted experimentally and failed in structurally analogous cases, we flag it. Stage four is synthetic accessibility and tautomer assessment. After these four stages, roughly 15–20% of generative proposals survive, which is still a large absolute number given generation rates. But the precision of the surviving set — fraction that show activity in validation — is substantially higher than using generative proposals unfiltered.
When It Actually Works: The Right Problem Class
Generative scaffold hopping performs well on a specific and well-defined problem class: bioisosteric replacements of privileged substructures with established alternatives. When a program has a metabolically labile amide and we ask the generative model to propose tetrazole bioisosteres, it does so reliably and the proposals have high experimental validation rates. When a program has a hERG-active basic amine and we ask for alternatives that reduce pKa while maintaining H-bond donor capacity, the model generates structurally diverse but pharmacologically sensible proposals. These work because the training data is dense in this region of bioisostere space and the pharmacophore satisfaction problem is relatively constrained. The failure modes concentrate in two situations. The first is when the hop is large — replacing an entire ring system rather than a substituent group. Large scaffold hops have more degrees of freedom for the conformational strain problem to manifest. The second is when the target protein has unusual binding site geometry — a pocket with narrow shape constraints where the same pharmacophore features can only be satisfied in a precise spatial arrangement. In these cases, the generative model proposes structures that satisfy the pharmacophore in open space but cannot thread the needle of the tight binding site without clashes.
Integrating Generative Hopping with Structure-Based Methods
The most effective workflow we've found couples the generative model with explicit structure-based refinement in a tight loop. The generative model proposes a diverse set of hop candidates, the structure-based filter (docking plus short MD) eliminates geometric hallucinations, and then the surviving candidates feed back into the generative model as positive examples to bias the next round of proposals. This loop runs three to four times before proposals stabilize. The practical effect is that the first-round proposals are diverse but noisy, and by round four the model is proposing structures that look chemically sensible to a medicinal chemist rather than structurally exotic but conformationally implausible. We have not solved the hallucination problem — we've built a filter that catches most hallucinations before they reach synthesis. That's not the same thing. The right long-term solution is generative models that incorporate explicit binding pocket geometry as a hard constraint during generation rather than as a post-hoc filter. We're seeing early promising results with pocket-conditioned diffusion approaches, but they are not yet robust enough to replace the filter stack for production use on our program timelines.
What This Means for IP Strategy
One underappreciated consequence of high-throughput generative scaffold hopping is its impact on freedom to operate analysis. A generative model can enumerate a large fraction of the plausible scaffold space around a pharmacophore in days rather than years. That means competitor IP that would take a traditional medicinal chemistry campaign years to map can be surveyed computationally early in the program. We now run defensive scaffold enumeration as a standard phase at program initiation — not to find the best scaffold, but to identify whether the pharmacophore space is densely patented and whether there is structural room to maneuver. This doesn't change the chemistry, but it changes the prioritization: programs where the pharmacophore space is patent-crowded are triaged differently than ones where multiple hops exist into clear IP space. The hallucination problem affects this use case too — we apply a lighter version of the filter stack for IP scouting, focused on chemical plausibility rather than binding geometry, which lets us run higher throughput at the cost of some false positives in the IP map.