The standard ADMET panel used in early-stage computational screening — CYP1A2, 2C9, 2C19, 2D6, 3A4 inhibition; aqueous solubility; Caco-2 permeability; hERG inhibition — covers the failure modes that were most prevalent in drug development programs in the late 1990s and early 2000s. Lipinski's rule-of-five was codified in 1997 on a dataset of orally bioavailable compounds from that era. The ADMET tooling that crystallized around those filters was essentially solving a problem that was already being recognized and actively avoided by experienced medicinal chemists.
The result is that computational ADMET triage in 2024 is very good at filtering out compounds that a skilled medicinal chemist would have flagged anyway — and less good at predicting the endpoints that actually drive Phase I attrition in modern programs, which have systematically moved away from the worst Ro5 violators and classic CYP liabilities.
What the Phase I Failure Data Actually Shows
Multiple retrospective analyses of Phase I failures over the past decade consistently attribute roughly 30-40% of failures to safety/toxicology issues that were not predicted computationally, and another 20-25% to PK failures that went beyond the standard ADMET panel. The specific endpoints that appear repeatedly in these attributions are not CYP inhibition or aqueous solubility — those are caught. They are:
- Reactive metabolite formation: Electrophilic intermediates (quinones, epoxides, aldehydes, nitroso species) generated by oxidative metabolism that covalently modify proteins. CYP-mediated bioactivation to reactive metabolites is mechanistically distinct from CYP inhibition, and most standard ADMET tools predict the latter but not the former with any reliability.
- Mitochondrial toxicity: Uncoupling of the electron transport chain, inhibition of Complex I-V, or disruption of the mitochondrial membrane potential. These endpoints have become more prominent as the candidate pool has shifted toward more lipophilic, mitochondria-penetrating compounds. Standard ADMET panels do not include a mitochondrial liability endpoint.
- Transporter interactions beyond P-gp: P-glycoprotein efflux is included in many ADMET panels. Organic anion-transporting polypeptides (OATP1B1, OATP1B3), organic cation transporters (OCT1, OCT2), and BCRP are underrepresented in training data and not routinely predicted. These transporters affect hepatic uptake, renal elimination, and CNS penetration in ways that can dramatically alter the actual PK of an otherwise viable compound.
- Plasma protein binding at high binding fraction: fu (unbound fraction) prediction is included in some tools (e.g., pkCSM) but with poor accuracy at extremes. Compounds with predicted fu < 1% are almost certainly misclassified at a high rate because the training data at those binding fractions is sparse. Yet it's precisely at high protein binding that the free drug hypothesis breaks down and PK becomes unpredictable from standard clearance estimates.
- CYP time-dependent inhibition (TDI): Mechanism-based inhibition that builds up irreversibly over time is distinct from reversible CYP inhibition. TDI risk is substantially harder to predict because it requires modeling the reactive intermediate chemistry, not just the inhibitory interaction. Most models trained on reversible IC50 data do not capture TDI.
Why Standard Tool Coverage Stopped Here
The practical reason ADMET tools converged on the standard eight-or-so endpoints is data availability. CYP inhibition, solubility, Caco-2 permeability, and hERG are routinely measured in high-throughput assays at pharmaceutical companies and have generated large public and proprietary datasets. The ChEMBL bioactivity database contains hundreds of thousands of entries for CYP1A2, 2C9, and 3A4 inhibition. It contains far fewer entries for OATP inhibition, reactive metabolite trapping, or mitochondrial toxicity assays.
Machine learning models require training data. Endpoints with sparse, heterogeneous, or assay-specific data generate models with wide confidence intervals and poor applicability domains. Publishing a model for a sparse endpoint is a liability: it will be used beyond its applicability domain, produce confident-looking wrong predictions, and reduce trust in the tool overall. The rational choice — given the incentives of a commercial ADMET tool company — is to focus on endpoints where the models work well.
We're not saying this is wrong in principle. We're saying it means the ADMET tooling landscape in 2024 has a systematic blind spot in exactly the endpoints that have become the dominant failure modes as the field has improved on the classic liabilities.
The Reactive Metabolite Problem in Detail
Reactive metabolite formation deserves particular attention because it is (a) mechanistically understandable, (b) predictable in principle from structural features, but (c) poorly served by current tools.
The core problem is that predicting reactive metabolite risk requires predicting not just CYP metabolism (which site is oxidized) but the chemical reactivity of the metabolite (whether that product is electrophilic). This is a two-step prediction, and the error compounds. Existing models for site of metabolism (SoM) prediction — SMARTCyp, FAME, P450-metabolism tools — have acceptable accuracy for predicting which aromatic ring or aliphatic position is preferentially oxidized. What they don't predict well is whether the oxidized product is chemically reactive: a quinone-methide, an epoxide, a p-hydroxylaniline that can oxidize further to a nitroso species.
Structural alerts for reactive metabolites (e.g., the Derek Nexus alert library, or the structural alerts in ADMET Predictor) catch the obvious cases — anilines, furans, thiophenes, methylenedioxyphenyl groups. These alerts are necessary filters. They're not sufficient filters. In our own work, we've seen reactive metabolite signals on structurally clean-looking compounds that triggered none of the standard alerts, where the reactive intermediate was traced to an unusual oxidation sequence that generated a Michael acceptor in the metabolite. The key message: structural alerts for reactive metabolites are a lower bound, not a prediction.
Transporter Coverage: The Practical Gap
Hepatic OATP inhibition is now recognized as a significant DDI liability after the FDA issued updated guidance on transporter DDI studies in 2020 and 2022. Computational models for OATP1B1 and OATP1B3 inhibition are available — notably from the Molecular Descriptors group at Uppsala and from several commercial vendors — but the training data is heterogeneous (different assay formats, different transfected cell lines, different substrate probes) and the models show AUROC in the 0.72-0.78 range, which is adequate for rough triage but not for confident lead selection decisions.
The practical consequence: OATP liability compounds that pass early computational triage get synthesized, characterized, and sometimes advance to animal PK studies before the liability is identified. At that point, the medicinal chemistry options for removing OATP liability without affecting binding affinity are limited and often require scaffold redesign. If OATP prediction had been integrated at the virtual screening stage — even with a noisy model — scaffold choices that minimized OATP risk could have been prioritized earlier.
What a More Complete ADMET Panel Looks Like
The panel we use in our integrated pipeline covers 22 endpoints. Beyond the standard eight, we include:
- Reactive metabolite structural alert score (rule-based, not ML) + SoM-weighted electrophilicity estimate
- Mitochondrial membrane potential disruption (predicted from QSAR models trained on MitoTracker assay data from public compound sets)
- OATP1B1 and OATP1B3 inhibition (classification, threshold IC50 > 10 µM = low risk)
- OCT2 inhibition (relevant for renal clearance prediction)
- BCRP efflux (complements P-gp for CNS and oral bioavailability assessment)
- CYP3A4 time-dependent inhibition flag (structural alert based on the Mechanism-Based Inhibitor literature, not ML)
- Plasma protein binding at high fraction (fu prediction using a Gaussian process model with explicit uncertainty quantification)
- Hepatocyte intrinsic clearance (two-compartment estimate combining microsomal data prediction and passive permeability)
- Phospholipidosis risk (amphiphilic compound detection using the Ploemen structural alert plus ClogP/pKa)
Several of these are rule-based rather than ML because the training data doesn't support a reliable ML model. We are explicit about that in the output: each endpoint carries a confidence indicator that reflects whether it's coming from a well-trained model (e.g., CYP2C9 inhibition) or from a structural alert with known limitations (e.g., reactive metabolite, TDI flag). Treating a structural alert output and an ML model output identically in a scoring function would give false confidence in the less reliable endpoints.
Uncertainty Quantification Is Not Optional
The single most under-discussed aspect of ADMET prediction in drug discovery is uncertainty quantification. A point prediction of CLint = 12 mL/min/mg is nearly useless without a confidence interval. A compound with predicted CLint of 12 ± 3 (tight model, good applicability domain) should be treated very differently from one with predicted CLint of 12 ± 25 (wide uncertainty, compound at edge of applicability domain).
Gaussian processes and ensemble methods can provide calibrated uncertainties on ADMET predictions, but most commercial tools report only point predictions. This creates a systematic bias in how results are interpreted: users treat all predictions as equally reliable, which means the model's failures are distributed across the dataset in ways that are hard to anticipate. When our pipeline flags high uncertainty on a specific compound-endpoint pair, that's actionable information — it means the compound sits in a region of chemical space where the model's training data is sparse and the prediction should be experimentally verified before synthesis decisions are made on that endpoint.
The field has made genuine progress on ADMET prediction accuracy for the canonical endpoints. The work remaining is not to improve CYP inhibition prediction — which is already good enough for practical triage. The work remaining is to build reliable, uncertainty-aware models for the endpoints that are currently causing Phase I failures, using whatever data is available — even when that data is sparse, heterogeneous, and assay-format-sensitive. That's a harder problem than improving an existing CYP3A4 model from AUROC 0.88 to 0.91.