Selectivity failures account for roughly 22% of Phase II clinical trial discontinuations, by most analyses of the PhRMR Tufts-era dataset. That number has been stable for nearly two decades despite improvements in target validation. The reason isn't that medicinal chemists don't care about selectivity — they do, intensely — but that standard in vitro selectivity profiling happens late, after analog series have already committed to a scaffold. By the time a counterscreen panel comes back with a hERG hit or a kinase promiscuity flag, you've optimized yourself into a chemical corner.
We run pan-proteome off-target docking at the hit identification stage. The current structural coverage is 2,312 human proteins with high-confidence crystallographic or cryo-EM structures, supplemented by AlphaFold2 models for an additional 1,800 targets where the predicted pLDDT exceeds 0.85 across the binding site residues. This post is an honest accounting of what that screen finds, what it misses, and why the latter category matters as much as the former.
Why Standard Selectivity Panels Are Structurally Blind
Most early drug discovery programs rely on SafetyScreen71-type panels — a curated set of 70-odd targets known to produce adverse events when inhibited off-target: hERG, various CYPs, sigma receptors, adrenergic receptors, monoamine transporters. These panels were assembled retrospectively from approved drug safety data. They are useful. But they encode historical failure modes, not prospective structural liability.
The structural liability problem is different. A scaffold that fits snugly into EGFR ATP pocket has geometric affinity for every kinase with a DFG-out conformation. A compound optimized against a GPCR hydrophobic binding cleft shares that shape with dozens of orphan receptors that SafetyScreen71 never tests. Panel-based counterscreening is chemotype-blind — it tests a fixed list of targets, not the structural neighborhood your scaffold actually occupies.
Computational pan-proteome docking inverts this. Instead of asking "does this molecule hit our liability list," we ask "which structures in the solved proteome does this molecule fit into, and why." The answer is often surprising in both directions: expected liabilities that don't materialize, and unexpected structural relatives that would never appear on a standard panel.
Building the Structural Library: Coverage and Quality Thresholds
The 2,312-structure crystallographic library was assembled from PDB entries filtered at resolution ≤ 2.5 Å, with manual curation to remove structures with poorly resolved binding site loops and structures where the co-crystallized ligand displaces more than one contiguous secondary structure element relative to the apo form. That last filter removes roughly 15% of otherwise qualifying structures — these are cases where crystal packing or ligand-induced conformational change makes the deposited structure a poor model of the native binding geometry.
Binding site definition is done with fpocket, followed by validation against the co-crystallized ligand position for structures that have one. Sites where the fpocket prediction misses the crystallographic ligand centroid by more than 3 Å are flagged for manual review. About 8% of the library requires manual site box adjustment — usually for allosteric sites or for proteins where the largest druggable cavity is not where known ligands bind.
For AlphaFold-supplemented structures, we apply a stricter pLDDT threshold (≥ 0.85 per-residue averaged across the putative binding site) and only include targets where the structural class has crystallographic homologs that can serve as fold validation. We don't use AlphaFold models for novel fold classes where there's no crystallographic reference — the uncertainty there is too high to score docking poses reliably.
What the Screen Consistently Surfaces
Across 14 programs we've run pan-proteome screens on, a few patterns recur with enough regularity to be worth naming.
Kinase cross-reactivity is more pervasive than expected
If your primary target is a kinase, docking-predicted selectivity indices against the broader kinome are almost always worse than they look from on-target optimization alone. The DFG-in conformation is particularly problematic: scaffolds that achieve sub-nanomolar potency against an in/in kinase by filling the adenine-mimicking pocket tend to dock favorably against 40-60 other kinases without any selectivity optimization. The physicochemical requirements for kinase potency and kinase selectivity are genuinely in tension — you need polarity for hinge contacts, but polarity also extends binding energy into conserved regions of the ATP site.
We've found that incorporating a kinase tree selectivity map early — scoring against a 480-kinase subset using our GNN scoring function rather than physics-based docking for speed — shifts scaffold selection in ways that measurably reduce kinome promiscuity before any wet-lab work begins.
GPCR off-target signals cluster by hydrophobic volume
For non-kinase primary targets, the most common off-target docking signal we see involves class A GPCRs. This isn't because the program is targeting a GPCR — it's because many CNS-penetrant scaffolds have the molecular weight and hydrophobic volume to fit the transmembrane bundle. When we screen a 300 Da compound with cLogP around 3.5 against GPCR structures, we reliably see top-10 docking poses in 6-12 class A GPCR structures. Whether this predicts binding is a separate question (discussed below), but it flags chemical matter worth monitoring.
Nuclear receptor hits are high-confidence and high-consequence
Nuclear receptor off-target docking signals — particularly androgen receptor, estrogen receptor alpha, glucocorticoid receptor — tend to be both structurally specific and experimentally validated when we eventually close the loop. The LBD pockets are large, well-defined, and geometrically discriminating. Docking scores in these sites above our threshold of −9.5 kcal/mol (Vina units, calibrated against the crystallographic ligand) have translated to measurable receptor activity in reporter assays at rates we estimate around 40-50% in our limited experimental validation set. That's not a hard number — our validation set is small — but it's high enough that we treat nuclear receptor docking hits as de facto liabilities requiring explicit counterscreening, not advisory notes.
What the Screen Misses — and Why This Matters
Pan-proteome docking is not a substitute for experimental counterscreening. It would be intellectually dishonest to present it as one. The failure modes are structural.
The biggest gap is induced fit. Standard rigid-receptor docking misses binding sites that don't pre-exist in the crystallographic structure. hERG is the canonical case: the inner cavity of the hERG channel forms only after gating, and the crystal structure available for docking is the closed-state conformation. Compounds that cause QT prolongation by blocking the open channel are systematically underscored in rigid hERG docking. We use a flexible-receptor protocol for hERG specifically, but this example illustrates a broader problem — any target where the relevant binding-competent conformation is not the deposited crystal structure will be a coverage blind spot.
Covalent off-targets are entirely outside this screen's scope. If your compound has an electrophilic warhead, pan-proteome docking won't capture cysteine-reactive off-targets unless you run covalent docking protocols specifically. We do this for programs where the primary mechanism is covalent — but it's a separate workflow, and the structural coverage of cysteine-proximate binding sites is much lower than for non-covalent targets.
Finally, metabolic intermediates. The screen docks the parent compound. If CYP3A4 generates a reactive metabolite with a different geometry, that metabolite's off-target profile is invisible unless you model it explicitly. For compounds with significant first-pass metabolism or CYP-mediated bioactivation, parent-only docking gives you a partial picture.
Selectivity Index as a First-Pass Filter
The output of a pan-proteome run is a ranked list of off-target docking scores. To make this actionable, we use a selectivity index (SI) defined as the ratio of on-target predicted binding affinity to the nth-percentile off-target affinity, where n is set to 5 — i.e., the 5th-best off-target hit. An SI above 100-fold at this threshold is our pass criterion for advancement to wet-lab synthesis, and we flag anything below 30-fold for team review regardless of the on-target score.
In a recent kinase inhibitor program targeting CDK7, we ran this screen on 84 computational hit candidates before selecting 12 for synthesis. The SI filter eliminated 31 candidates that had on-target GNN scores equivalent to candidates that passed — their scores against CDK2, CDK9, and CDK12 were simply too close to the CDK7 score to justify synthesis. Two of the 12 synthesized compounds returned wet-lab IC50s against CDK7 below 15 nM with greater than 40-fold selectivity against CDK2 in biochemical assay — a hit rate we consider reasonable for a first synthesis round on a competitive kinase target.
We're not claiming the screen predicted those specific IC50 numbers. What it did was filter out the structurally promiscuous candidates before any wet-lab investment, which is the actual job it needs to do.
Scaling the Screen Without Scaling Compute
Docking 10,000 compounds against 2,300 structures would be computationally prohibitive at physics-based docking speed. Our current workflow uses the GNN scoring function for the first pass — scoring all compound-target pairs in a reduced fingerprint representation to generate a rough selectivity landscape — and then runs full physics-based docking only for compound-target pairs that clear a preliminary threshold. This two-stage approach runs a 10,000-compound library against the full structural panel in roughly 18 hours on 4 V100 GPUs. That's still slow by cheminformatics standards, but fast enough to be a standard component of hit triage rather than a special-case analysis.
The limiting factor is no longer compute — it's interpretability. When 2,300 structures return docking scores for 10,000 compounds, the output is a 23-million-entry matrix. Deciding which off-target signals are worth acting on requires domain judgment that doesn't reduce to a score threshold. We've been building visualization tooling to make the structural clustering of off-target hits interpretable at a glance, but this remains the most labor-intensive part of the workflow.