AlphaFold Changed Our Inputs. It Didn't Change the Scoring Problem.

When AlphaFold2 was released in 2021 and its successor models followed in 2022-2023, the structural biology community's initial response was appropriately calibrated: this solves a specific and important problem — predicting the folded structure of a protein from sequence — and it solves it well enough to be practically useful. The drug discovery community's response was less calibrated. A significant portion of commentary implied that AlphaFold had essentially solved the SBDD problem at the input stage, that we now had structures for all tractable drug targets, and that the bottleneck had shifted elsewhere.

The structural coverage claim is largely correct. The bottleneck claim requires more careful examination.

What AlphaFold solved was the structure prediction problem for the apo protein fold. What it did not solve — and what remains unsolved — is the scoring problem: given a protein structure and a proposed ligand binding pose, how accurately can we predict the binding affinity? Those are different problems. The first is now largely tractable; the second remains genuinely hard. The confusion between them has led some groups to set up docking pipelines on AlphaFold structures with the same protocols used for high-resolution crystal structures, and to be surprised when the hit confirmation rates are disappointing.

What AlphaFold Structures Are and Are Not

AlphaFold2 predicts the ground-state folded structure of a monomeric protein with remarkable accuracy on global fold metrics — TM-score against experimental structures in the range of 0.92-0.96 for well-folded proteins, backbone RMSD of 1-2 Å on average. The pLDDT confidence metric correlates well with local structural accuracy: regions with pLDDT > 90 are generally reliable to within ~1 Å backbone RMSD; regions with pLDDT < 70 are often disordered in solution and the predicted conformation may not reflect the physiological structure.

The critical limitation for drug discovery applications is that AlphaFold predicts the apo structure — the protein in the absence of any ligand — trained on structures deposited in the PDB, which are predominantly apo or co-crystallized with small molecules. The holo conformation, the structure the protein adopts when a drug-like ligand is bound, can differ substantially from the apo structure. For GPCR-family targets, the transmembrane helix rearrangements between apo and ligand-bound states can be 3-5 Å at key residues. For kinases, DFG-flip conformational changes shift the binding pocket geometry by several angstroms. AlphaFold doesn't predict these ligand-induced changes because they weren't the problem it was trained to solve.

The consequence for docking is straightforward: docking into an AlphaFold apo structure for a target that undergoes significant induced-fit upon ligand binding will produce poses that are systematically wrong for compounds that engage the holo conformation. The scoring function will evaluate a pose against the wrong receptor geometry and produce scores that don't reflect actual binding.

How We Assess AlphaFold Structure Quality for Docking

Before starting any campaign on an AlphaFold-derived structure, we run a structured quality assessment with three components:

Binding Site pLDDT Distribution

We extract all residues within 6 Å of the predicted binding site centroid (identified from SiteMap) and compute the mean pLDDT and fraction of residues with pLDDT > 85 in that subset. For binding sites where the mean pLDDT is above 88 and >80% of site-defining residues have pLDDT > 85, we consider the site reliable for standard docking. For sites where the pLDDT distribution has a significant low-confidence tail, we flag the campaign as high-uncertainty and adjust our confidence thresholds accordingly.

Structural Homolog Comparison

We BLAST the target sequence against the PDB and retrieve the top 5-10 homologs by sequence identity that have experimental crystal structures. For each homolog, we align the binding site and compute the backbone RMSD between the AlphaFold prediction and the experimental homolog at the binding site residues. High binding-site RMSD against experimental homologs (typically > 2 Å at key interaction residues) is a red flag indicating the AlphaFold prediction may be in a conformation dissimilar to what a ligand would induce.

MD-Based Ensemble Generation for High-Uncertainty Sites

For targets where the AlphaFold structure quality is uncertain at the binding site, we run a short MD relaxation (50-100 ns, explicit solvent) starting from the AlphaFold structure to generate a conformational ensemble. The ensemble is then clustered and the top 3-5 cluster centroids are used as docking receptors. This ensemble docking approach partially compensates for the static AlphaFold conformation by sampling nearby conformational space — not by predicting the true holo conformation, but by reducing sensitivity to any single conformation.

This approach adds substantially to campaign setup time. For a target where we have a high-resolution co-crystal structure, we skip the AlphaFold quality assessment entirely and use the experimental structure. The AlphaFold route is used for targets that are genuinely structurally uncharacterized — and for those targets, the MD relaxation step is not optional, it's essential.

The Scoring Problem Remains the Scoring Problem

Improving the receptor structure inputs to docking reduces one source of error. It does not reduce the inherent limitations of the scoring function. Even with a perfect protein structure — a high-resolution experimental co-crystal — the Pearson r between docking score and measured K_d across a diverse compound library is 0.3-0.5 for classical empirical scoring functions, and 0.6-0.75 for well-trained GNN-based rescorers. These numbers don't change because the receptor input improved.

The confusion between the structure prediction problem (now largely solved) and the scoring problem (not solved) matters because it affects what we think the field still needs. A group that believes AlphaFold has solved the SBDD input problem may focus resources on better structure prediction (marginal gains) when the larger opportunity is in better scoring functions and better uncertainty quantification around scores. The input quality ceiling is now higher; we haven't reached the scoring function ceiling.

From our structural biology perspective, the AlphaFold models that are most useful are not the ones we use to replace experimental structures — we still obtain crystal structures and cryo-EM data on all our programs where the structural data will influence lead optimization decisions. The AlphaFold models are most useful for two specific applications: (1) early target assessment when we're evaluating whether a target is likely tractable before committing to a full structural program, and (2) homology modeling of closely related family members when we need to assess selectivity against off-targets for which no experimental structure exists.

AlphaFold-Multimer and the Protein-Protein Interface Problem

AlphaFold-Multimer extends structure prediction to protein complexes, which has direct relevance for PPI (protein-protein interaction) drug discovery. Predicting the interface geometry of a protein complex reasonably well is a prerequisite for identifying small molecules that can disrupt or stabilize that interface. Before AlphaFold-Multimer, PPI drug discovery at the computational stage required either an experimental complex structure or extensive homology modeling from partial structural data.

The accuracy of AlphaFold-Multimer on interfaces is lower than on monomeric folds — the ipTM score (interface predicted TM-score) is a better quality indicator than pLDDT for interface residues. For well-defined obligate complexes with high evolutionary coevolution signal, the interface predictions are useful. For transient or low-affinity complexes — the ones that are often the more interesting PPI drug targets — the interface prediction accuracy degrades significantly because the training data (experimental complex structures) is biased toward stable, well-crystallized complexes.

This is an active area where we expect the methods to improve substantially as more diverse complex structures are added to training sets and as physics-based refinement is integrated into the prediction pipeline. For our current programs, we use AlphaFold-Multimer interface predictions as hypothesis generators — they identify candidate interaction geometries that we then validate with orthogonal approaches (HDX-MS, cross-linking MS, mutagenesis data) before treating the interface model as a reliable docking receptor for small molecule virtual screening.

What Cryo-EM Added That AlphaFold Doesn't Replace

Single-particle cryo-EM has become the dominant structural method for membrane proteins and large complexes, and its resolution capability has reached 1.5-2 Å for favorable specimens — sufficient to place ligands and water molecules confidently. AlphaFold predictions for the same targets are competitive on backbone accuracy but miss two things that cryo-EM provides and that matter for drug discovery: the actual conformational state the protein is in under ligand-bound conditions, and the experimental electron density that confirms the ligand binding mode directly.

We solve cryo-EM structures of drug-target complexes at critical decision points in our programs — typically after identifying a lead compound class, to confirm that the computationally predicted binding mode is correct and to reveal any induced-fit changes that should update the pharmacophore model. The computational predictions are the hypothesis; the cryo-EM or X-ray structure is the test. That relationship doesn't change because AlphaFold made it cheaper to generate an initial structural hypothesis.

AlphaFold made drug discovery better by expanding the scope of targets accessible to structure-based methods. It is a genuine advance. What it didn't change is the fundamental challenge of the field: predicting whether a molecule will bind, how tightly, and with what selectivity profile. That remains hard, and the tools that address it — better scoring functions, better uncertainty quantification, better models of receptor flexibility — are where the remaining work lies.