Skip to main content

Research Notes

Physics-Based Scoring vs. Machine Learning: Not a Competition

Siddhartha Mukherjee
Physics-based versus ML scoring comparison — complementary failure modes on the same target set

There is a recurring argument in computational drug discovery circles about whether physics-based or machine learning scoring is the right approach for binding prediction. The argument usually takes the form of a benchmark comparison on CASF-2016 or the D3R Grand Challenges, with each camp claiming superior performance on the selected metric. This argument is, in our experience, mostly unproductive — because the two method classes fail on different compounds, in different ways, for different reasons, and combining them is almost always better than choosing one.

We've now run parallel physics-based and GNN-based scoring campaigns on seven targets in our pipeline over the past eighteen months. The patterns of agreement and disagreement between the two scoring methods are themselves informative, and the cases where they diverge most sharply are often the most instructive about which method to trust.

What Physics-Based Methods Get Right

By "physics-based scoring" I'm including the full range: classical empirical force-field scoring (AutoDock Vina, Glide SP/XP), knowledge-based potentials derived from statistical analysis of crystal structure contacts (DSX, HYDE), and semi-empirical quantum mechanical scoring (XTB-docked geometries rescored with GFN2-xTB). Each operates from different physical assumptions but shares the property that its predictions are not dependent on having seen similar compounds during training.

This training-independence is the core advantage of physics-based methods. For a compound that is genuinely novel — a new scaffold with no close analogs in PDBbind or ChEMBL — the GNN scorer has no choice but to extrapolate from its training distribution, and that extrapolation carries unknown error. The force-field scorer has no training distribution to extrapolate from: it evaluates the van der Waals overlap, hydrogen bond geometry, and electrostatic complementarity of the actual atoms in the actual pose. It may get the absolute affinity wrong (empirical force fields are parameterized for specific atom types and don't capture all quantum mechanical effects), but it's wrong in a way that is predictable from the physics.

A practical manifestation: on our bromodomain-2 campaign, we screened a library of 180,000 compounds that included a ~20,000-compound subset of non-standard macrocyclic and bicyclic scaffolds — chemotypes that are underrepresented in public training data. For this subset, Glide XP and GNN scoring showed only 34% overlap in their top-500 ranked compounds. The GNN tended to rank macrocycles more favorably based on molecular fingerprint similarity to known BRD2 binders; Glide ranked them based on direct geometric scoring of the pose in the BRD2 BD2 domain. When we subsequently tested 40 of these macrocycles biochemically, the Glide predictions had better recall on confirmed binders (Pearson r 0.48 vs. 0.31 for the GNN on the macrocyclic subset specifically), despite the GNN performing better on the overall library benchmark.

What Machine Learning Scoring Gets Right

For congeneric series — compounds sharing the same core scaffold with systematic substituent changes — GNN scoring consistently outperforms classical docking-based scoring. The reason is intuitive: within a congeneric series, the binding mode is largely conserved, and the affinity differences reflect subtle changes in interaction geometry, hydration, and electronic effects that empirical force fields parameterize crudely. A GNN that has learned from hundreds of measurements on related scaffolds captures these subtle contributions in its node embeddings without requiring explicit physical parameterization of each term.

This is exactly where FEP should also outperform classical scoring — and it does, but with substantially higher computational cost. For a congeneric series of 30-50 compounds where you need reliable rank ordering to guide synthesis, FEP (if you can afford the GPU time) is the reference method. GNN scoring is a practical approximation that achieves 70-80% of FEP's rank-ordering accuracy at 1-2% of the compute. That tradeoff is sensible for mid-stage lead optimization where you're making decisions about dozens of analogs per week.

GNN scoring also handles systematic scaffold hopping better than physics-based methods for in-distribution chemotypes. If you want to identify compounds from a new scaffold class that recapitulate the binding interactions of a known series, the GNN can recognize interaction patterns in the encoded 3D pose without requiring explicit geometric matching of the original pharmacophore. This is useful for diversity-oriented virtual screening where you're not just looking for more of what you already have.

The Divergence Cases Are Diagnostic

We've made it a practice to specifically examine compounds where our physics-based and GNN scoring disagree strongly — compounds ranked in the top decile by one method and bottom decile by the other. These divergence cases fall into predictable categories:

Category 1: GNN High, Glide Low — Novel Scaffold With Fingerprint Similarity to Known Binders

When a compound's 2D fingerprint looks like a known binder but the 3D pose doesn't place it optimally in the binding site, the GNN score may be driven partly by the chemical similarity (which encodes into the node features) rather than the actual binding geometry. This inflates GNN rankings for close chemical analogs of known binders that dock poorly. We treat these as "fingerprint artifacts" and apply a pose quality filter before promoting them.

Category 2: Glide High, GNN Low — Unusual Chemistry in the Training Distribution Tails

Compounds with unusual atom types (e.g., boron, fluorinated heterocycles with unusual electronics, sulfur in rare oxidation states) can produce favorable force-field scores through genuine geometric complementarity but receive low GNN scores because the GNN's atom embeddings for these atom types are trained on very few examples. We cross-check these against explicit quantum mechanical calculations (XTB single-point energies on the docked pose) to decide whether the Glide score is trustworthy.

Category 3: Both High — Strong Candidates

When physics-based and ML scoring agree on a high-ranked compound, the combined signal is more reliable than either alone. In our retrospective analysis, compounds in the top decile of both scores showed approximately 2.5x higher confirmation rate in biochemical assays compared to compounds in the top decile of one method and the middle two quartiles of the other.

Ensemble Scoring in Practice

Our current scoring pipeline produces a consensus score for each compound that weights Glide XP and GNN scores differently depending on (a) the structural novelty of the compound relative to the GNN training distribution, and (b) the target class. For targets with sparse PDBbind representation, we up-weight Glide; for targets with dense PDBbind coverage and when screening congeneric series, we up-weight GNN. The weights are not fixed — they're recalibrated when we obtain experimental data from a campaign and can measure each method's actual performance on the specific target.

The structural novelty assessment uses Tanimoto similarity against the GNN training set as a proxy for applicability domain: compounds with maximum Tanimoto coefficient against any training compound above 0.6 are considered in-distribution; those below 0.4 are flagged for increased Glide weight. Between 0.4 and 0.6 is a grey zone where we use both scores with equal weight and flag the result as higher uncertainty.

There's an important practical constraint here: building a reliable consensus scoring model requires enough experimental data to validate the weighting scheme on a per-target basis. We can't claim the consensus score is better than either component on a completely new target where we have no experimental feedback yet. In those cases, we default to Glide as the primary filter (because it has no training distribution to overfit to) and use GNN scoring as a secondary selector within the Glide-selected pool — which is approximately the cascade structure we described in the binding affinity models piece.

The Interpretability Difference

One underweighted advantage of physics-based scoring is interpretability. When Glide gives a compound a poor score, it's possible to decompose that score into its contributing terms and understand which specific geometric issue is causing the problem: a steric clash with Tyr93, a hydrogen bond that's 0.4 Å too long, a charged group buried without a counterpart. This decomposition directly informs medicinal chemistry: it tells you which substitution would fix the problem and how large a change you'd expect.

GNN score decomposition is possible in principle (via attention weights, gradient-based attribution, or message-passing attribution methods), but the connection between "this attention weight on edge X was high" and "make this structural change to improve binding" is not as direct. The GNN score carries more information about likely binding affinity than the force-field score, but less actionable information about what to do when the score is unfavorable.

Both methods are tools with complementary strengths. The question for any given campaign is not "which scoring function should we use" but "which scoring function's failure mode is more dangerous for this particular problem at this particular stage of the program." Asking that question before setting up the pipeline — rather than defaulting to the method that performed best on the last CASF benchmark — is the practical discipline that makes the difference between a productive computational campaign and a collection of compounds that don't confirm.

More from Research Notes