Every few months a new benchmarking paper appears comparing docking methods on CASF-2016 or PDBbind, and every few months the conclusions subtly contradict each other. The disagreements are predictable once you understand that binding affinity prediction is not a single problem — it's at least three problems conflated under one label: pose reproduction, ranking within a congeneric series, and absolute affinity prediction. Methods that excel at one often perform poorly at another, and RMSE on a curated benchmark dataset tells you almost nothing about campaign-level hit rates on novel targets.
This is a practitioner's perspective, not a benchmark paper. I'll describe what each method class does well, where it fails in actual campaigns, and why the training data coverage problem doesn't get enough attention in most comparisons.
Classical Docking: AutoDock Vina and Glide
AutoDock Vina remains the most widely used open-source docking engine, and for good reason: it's fast (seconds per compound on modern hardware), well-documented, and its performance on pose reproduction benchmarks is competitive with commercial alternatives for rigid targets with well-defined binding pockets. Vina's scoring function uses a Gaussian steric term, hydrogen bond term, hydrophobic term, and a torsional entropy penalty. Default Vina achieves median RMSD of ~1.8 Å on CASF-2016 pose reproduction benchmarks, which is adequate for most virtual screening applications where you need the top-ranked pose to be roughly correct.
Glide XP adds several features that matter in practice: an explicit solvation model based on OPLS force fields, a strain energy correction that penalizes energetically strained ligand conformations, and a more granular hydrogen bond geometry term. In our experience, Glide XP reduces false positives from strain-induced favorable poses — compounds that score well in Vina because the strain penalty is underestimated, but which are actually unfavorable when the conformational cost is properly accounted for. The tradeoff is throughput: Glide XP is roughly 20-50x slower than Vina per compound, which matters when screening multi-million compound libraries.
Neither method handles receptor flexibility well. Both assume the receptor remains essentially rigid during docking, with at most rotamer-level side-chain sampling (via GLIDE's induced-fit protocol or Vina's --flex option for specified residues). This is adequate for rigid binding pockets like the ATP-binding site of most kinases, and inadequate for flexible loops, allosteric sites, or targets with large apo-to-holo conformational changes.
Free Energy Perturbation: FEP+ and OpenFE
Relative FEP (FEP+, Schrodinger; or the open-source OpenFE platform) computes the free energy difference between two closely related compounds using alchemical perturbation paths through enhanced sampling MD. The accuracy achievable — typically RMSE of 0.8-1.2 kcal/mol on congeneric series benchmarks — is substantially better than docking scores for compounds within the applicability domain of the perturbation (roughly: compounds sharing the same core scaffold, differing at one or two substituents).
FEP is not a virtual screening method. It requires a high-quality starting pose, an accurate protein structure, and significant compute (GPU-hours per pair). We use FEP for late-stage hit-to-lead decisions, comparing ~10-20 close analogs of a confirmed binder to guide synthesis priorities. Applying FEP to rank-order compounds from a diverse virtual library is computationally intractable and methodologically inappropriate — the perturbation assumptions break down when the compounds are not structurally similar.
The more important limitation is that FEP accuracy degrades significantly for compounds that induce receptor conformational changes, for binding events that displace structural waters non-trivially, or for charged species where the long-range electrostatics are sensitive to force field parameters. RMSE of 0.8 kcal/mol on a curated benchmark dataset of drug-like compounds at well-characterized targets is not the same as RMSE of 0.8 kcal/mol on your actual lead series at your actual target.
GNN-Based Scoring Models
Graph neural network scoring models have become a productive area since KGCNN and the original work on SchNet and DimeNet established that 3D molecular graph representations could be learned end-to-end for property prediction. For binding affinity, the key architectural decision is whether the GNN operates on the ligand graph alone (using the docked pose coordinates as inputs but treating the receptor implicitly), or on the full protein-ligand interaction graph with message passing across residue-ligand edges.
Full protein-ligand GNNs — like the PotentialNet family, or the IGN architecture — can in principle capture interaction geometry that classical scoring functions miss: non-canonical hydrogen bond geometries, pi-stacking arrangements, and water-mediated contacts. In practice, their performance is strongly dependent on training data distribution. A model trained exclusively on PDBbind with standard train/test splits will show high Pearson r (0.80+) on CASF-2016 — but much of that performance reflects the model having learned kinase-like binding modes that dominate the training set, not generalizable binding physics.
We train our GNN scorer on a hybrid dataset: PDBbind as the base, augmented with our own experimental data and with carefully curated ChEMBL entries where a crystal structure can be reliably associated with a measured binding affinity. The augmentation matters most for the target families we actually work on — bromodomains, FBDD fragment hits, and allosteric sites — which are underrepresented in PDBbind. Without augmentation, we found our GNN model performed comparably to Glide XP on those targets; with augmentation, we measured a consistent improvement of 0.12-0.18 Pearson r on held-out data.
Message Passing Depth and Overfitting
One underappreciated issue in GNN scoring is the relationship between message passing depth and overfitting on the binding pose. With 3-4 message passing layers, the model aggregates information from atoms within approximately 4 bond hops of each node — sufficient to capture local interaction geometry. With deeper message passing (6+ layers), the model starts aggregating global scaffold information, which can improve correlation with affinity on in-distribution data but also means the model is partially doing molecular property prediction from 2D scaffold features, not 3D binding geometry. Disentangling these contributions is important for understanding whether a GNN improvement reflects better binding geometry capture or better ligand property encoding.
The Training Data Coverage Problem
Every method comparison that uses CASF or PDBbind as the test set is subject to training contamination in ways that are harder to control than standard random splits. Crystal structures and affinity data for a given target family tend to cluster in both structural space (same protein family, similar binding pocket geometry) and chemical space (same scaffold families developed by the same medicinal chemistry programs). When you split randomly and compute Pearson r, you're measuring performance on test compounds that are chemically similar to training compounds from the same protein family — not generalization to new targets or new chemical series.
A more useful benchmark for our purposes is the TargetDiff protocol: hold out entire protein families from training, test on novel families. Under that protocol, GNN models show significantly degraded performance (Pearson r typically drops by 0.15-0.25) compared to standard random-split benchmarks. Vina's performance degrades proportionally less, which is somewhat expected since Vina is not trained on affinity data at all — it can't overfit to the training distribution because it has no training distribution for affinities.
This is the nuance most benchmark papers miss: methods that score well on PDBbind random splits may be doing sophisticated interpolation within a covered chemical/structural space, not learning the underlying physics of binding. For virtual screening on novel target families or with novel chemotype libraries, the actual performance gap between methods is smaller than benchmarks suggest — and the failure mode is more symmetric.
How We Actually Combine These Methods
Our current campaign workflow for a new target uses Vina for primary docking of a multi-million compound library (sub-library selected by 2D fingerprint clustering to ensure chemical diversity), Glide XP rescoring of the top 5% by Vina score, then our GNN rescorer on the top 0.5% from Glide. The output of this cascade is a pool of 200-500 compounds, which we then cluster by scaffold and apply SAScore filtering (synthetic accessibility score, Ertl & Schuffenhauer) to remove synthetically impractical structures before ranking.
The GNN rescoring step does the most work for scaffold hopping: when the Glide-top compounds cluster heavily around one known scaffold type, the GNN can surface chemically distinct compounds that score favorably at the interaction geometry level but were ranked below them by Glide's physicochemical terms. This happens because Glide's desolvation and strain terms are calibrated against the training distribution of drug-like compounds, which can disadvantage non-standard scaffolds even when their binding geometry is sound.
We're not claiming this cascade is optimal — it's the current operational state after iterating on three target programs. The key principle is that each method in the cascade is being asked to do the job it was designed for: Vina for rapid geometric filtering, Glide for physicochemical plausibility, GNN for binding geometry rescoring. Using any single method to do all three leads to the false positive rates that make virtual screening campaigns disappointing.