Why 95% of Virtual Screening Hits Fail When They Reach the Wet Lab

The 95% failure rate for virtual screening hits is cited so often it has become background noise. We cite it ourselves when explaining to collaborators why our workflow departs from standard SBDD practice. But the number deserves more careful dissection, because it obscures the mechanism — and if you don't understand the mechanism, you can't fix it.

This is not a wet lab problem. Biochemists running SPR and thermal shift assays are not the bottleneck. The failure happens upstream, in the scoring function, and understanding exactly where requires walking through the physics those functions approximate — and where those approximations collapse.

What a Docking Score Actually Measures

Classical docking scores estimate the free energy of binding through an empirical sum of terms: van der Waals interactions, hydrogen bond geometry, hydrophobic contact area, entropy penalties for conformational restriction, desolvation. AutoDock Vina's scoring function, for instance, regresses against PDBbind data to fit weights for these components. Glide XP adds explicit solvation terms and a strain energy correction. ChemScore, GOLD's default, adds a metal-coordination term.

The problem is not that these terms are conceptually wrong. It's that they're calibrated on crystal structures of known binders at physiological conditions, then applied to virtual screening libraries under assumptions that frequently don't hold:

The receptor is treated as rigid, or at best allows modest side-chain flexibility via rotamer sampling. Induced fit — the conformational reorganization that happens when a novel chemotype enters a binding site — is minimized or ignored.
Explicit water molecules in the binding site are either removed entirely or treated with generic desolvation penalties, not computed from free energy of displacement.
Conformational entropy of the ligand is estimated from a rotatable-bond count proxy, not from the actual energy landscape.
The scoring function's training distribution is PDBbind, which is heavily biased toward druglike compounds with known binding affinities. Novel chemotypes that fall outside this training manifold receive poorly calibrated scores.

A compound can achieve a favorable docking score by placing a hydrogen bond donor in a geometrically plausible position — but if that position requires the receptor helix to shift 1.5 Å, the compound won't bind. The score doesn't know the difference.

The PDBbind Coverage Problem

PDBbind 2020 contains approximately 19,000 protein-ligand complexes with measured binding affinities. On its face, that sounds like a rich training set. In practice, the coverage per target family is uneven enough that scoring functions trained on it generalize poorly across chemical space.

Kinase inhibitors dominate the dataset — roughly 35% of entries are from the kinase family. GPCRs, ion channels, and protein-protein interaction targets are dramatically underrepresented. A scoring function trained on PDBbind will, by construction, perform best on kinase-like binding modes (deep hydrophobic pocket, two hinge hydrogen bonds, solvent-exposed tail) and worst on the targets that are hardest to drug — precisely the targets where virtual screening adds the most potential value.

We audited our own historical docking campaigns against targets where we later obtained experimental data. The Pearson r between Glide XP docking score and measured K_d was 0.31 across 84 compounds on a bromodomain target. On a Hsp90 N-terminal domain campaign it was 0.44. These numbers are consistent with what the literature reports: docking scores are weak predictors of absolute affinity, with typical Pearson r in the 0.30-0.50 range against experimental data, compared to Pearson r of 0.75-0.85 for well-trained machine learning potentials on held-out test sets.

Three Specific Failure Modes

1. False Positives From Promiscuous Scaffolds

PAINS (pan-assay interference compounds) are well-documented, but docking pipelines rarely filter them aggressively because their structural features can produce excellent docking scores by design — they interact with many protein environments nonspecifically. Rhodanines, catechols, quinones, and Michael acceptors score well in docking because their electrophilic or chelating character satisfies scoring terms that proxy for specific interactions.

A PAINS filter applied post-docking catches the obvious cases, but the underlying problem is that the scoring function has no term for assay interference chemistry. It's architecturally blind to whether the compound is binding specifically in the pose or reacting with the protein or aggregating in solution.

2. Protonation State Sensitivity

Binding affinity can shift by 2-3 log units depending on the protonation state of a titratable group at the active site. Scoring functions typically assign a single protonation state based on the most probable state at pH 7.4, using Epik or equivalent. But the microenvironment of a buried binding site is not at pH 7.4. Histidine tautomers, buried aspartate residues, and active-site lysines all operate at shifted pKa values that Epik estimates imperfectly.

This is not a small effect. In our kinase work, we routinely find 5-10% of top-ranked compounds where the pose that drove a favorable score required the wrong tautomer to dominate. When the correct protonation state is used, the pose destabilizes by 3-5 kcal/mol.

3. Receptor Flexibility at the Scoring Step

Induced-fit docking (IFD) partially addresses receptor rigidity by allowing side-chain and loop repacking. But the IFD protocol requires a starting structure that is close to the true binding mode — it refines, it doesn't predict large conformational changes from scratch. When a novel scaffold induces a DFG-flip in a kinase, or opens a cryptic allosteric pocket, IFD will miss the relevant binding mode because it's not sampling that conformational state.

For GPCRs and other highly flexible targets, this problem is compounding: the experimentally-determined apo structure may differ from the holo structure by several angstroms of backbone displacement, and the holo structure itself varies by compound class. Docking into the apo structure effectively asks the scoring function to score against the wrong receptor geometry for any ligand that isn't closely similar to the co-crystallized compound.

Why Triage Matters More Than Ranking

One implication of poor Pearson r between docking score and K_d is that absolute ranking is unreliable. A compound ranked 1st vs. 50th in a virtual screen of 10,000 molecules is not meaningfully better in binding affinity terms — within a rank-ordered list, the score variance from noise frequently exceeds the signal.

What docking scores are actually useful for is coarse triage: removing the bottom 90% of a library that is clearly geometrically incompatible with the binding site. The top 10% of a docked library has an enriched probability of including true binders, even if the exact rank order within that 10% is noisy. Treating the top-ranked compound as the most promising lead, rather than the top-decile as a pool for further analysis, is the operational mistake that inflates failure rates.

This has direct consequences for how we structure our campaigns. We run primary docking to identify the top-decile pool, then apply a second-pass GNN-based rescoring — trained on our own experimental data rather than PDBbind — to re-rank within that pool. The GNN scorer uses 3D molecular fingerprints from the docked pose, not just 2D SMILES, which means it captures the binding geometry as an input feature. That combination reduces our false positive rate substantially compared to using docking score alone as the primary selection criterion.

The Role of Experimental Feedback

No computational method — docking, GNN rescoring, FEP — improves in isolation. The practical reality of our workflows is that wet lab feedback on even 20-30 compounds from a campaign provides enough signal to recalibrate the scoring model for that target class. We're not saying wet lab synthesis is unnecessary: we're saying the minimum synthesis required before a scoring model becomes trustworthy for a given target is smaller than most discovery organizations assume.

Specifically, we find that 25-30 confirmed binders with measured K_d values on a target, covering three or more chemotypes, are sufficient to fine-tune our GNN rescoring model to achieve Pearson r of 0.68-0.72 on held-out compounds for that target. That's not FEP-level accuracy, but it's accurate enough to distinguish lead-quality hits from noise — which is the actual job of a virtual screening triage step.

The 95% failure rate is not an inherent property of virtual screening. It is the consequence of using a scoring function designed for ranking in contexts where it was calibrated to do coarse triage, applied to chemical space that sits outside its training distribution, on targets that require receptor flexibility it cannot model. Fixing the failure rate means understanding these specific limitations and designing workflows that work around them — not abandoning computational triage, which would be an overcorrection in the other direction.