In Silico ADME-Tox Prediction is a Regression Task
Numerical output gives discovery scientists more flexibility and utility
ADME Talks is a newsletter democratizing the cutting edge in virtual ADME-Tox, pun intended.
Were you forwarded this post? Please subscribe:
Introduction
In the November 2018 paper “In silico assessment of ADME properties: Advances in Caco-2 Cell Permeability Modeling” Pham-The et al. make an account of the progress made in the in silico Caco-2 permeability prediction literature over the previous 20+ years.
In their review, the authors observe the recent trend toward modeling Caco-2 permeability as a classification problem that outputs a High/Moderate/Low permeability category for input compounds—as opposed to a regression problem that outputs a numerical predicted permeability value. Speaking to the trend, they offer an argument in favor of predicting Caco-2 permeability into classes:
Generally, classification models offer certain advantages relative to the regression ones, especially the ability to capture the structure-permeability relationship of a wider variety of data than regression and the possibility to model data recollected from different sources. As Caco-2 permeability data show a great variability, regression models developed based on a database assembled from different laboratory become unreliable. Classification modeling is indeed more feasible in this sense as the endpoints are class labels that do not rely too much on the degree of lab-to-lab variability. Consequently, training set can be composed of thousand compounds experimented by different laboratories, and thus QSPR model’s applicability domain significantly expands. Additionally, classification models normally show better performance compared to regression ones, then these models appear to be more suitable for the high-throughput screening tasks in early stages of drug research and development. [sic]
Pham-The et al.’s argument boils down to two points:
The classification setting makes inherently noisy in vitro data viable for machine learning. Class labels can weather both inter-lab and intra-lab sources of noise from in vitro measurements, providing more consistency across replicate compound measurements. Consequently—by virtue of being able to incorporate more train/test data—classification models have a larger applicability domain.
Classification models have better performance and thus are more suitable for high-throughput screening in early discovery.
Addressing Point 1
There is merit to the principle behind the first argument. Wet lab data, especially aggregated from diverse public sources, can be messy and subject to a high degree of experimental variability. While treating ADME-Tox prediction as a classification problem doesn’t remedy the fundamental noise issue, it can make it easier to weather that noise and incorporate low quality data points as long as noisy entries don’t fall near determined class thresholds. If an assayed value is anywhere near the chosen classification threshold, however, experimental noise could result in incorrect class labeling, obviating much of this apparent advantage of the classification setting.
This first argument largely disregards the existence of data curation methods that can make large, noisy in vitro regression datasets suitable for machine learning. The ATOM consortium, for example, promotes a robust statistical method for handling replicate measurements in the curate_data.py module of their ATOM Modeling Pipeline for Drug Discovery (AMPL). In that module, they “compute a maximum likelihood estimate of the true mean value underlying the distribution of replicate assay measurements for a single compound.” This approach enables the incorporation of all available assay data into in silico regression models for ADME-Tox, and, in turn, renders moot the argument for the classification setting’s superior applicability domain. Likewise, Sorkun et al. (2019) take principled steps that provide a working model for how to curate regression-ready data from a broad range of public sources. OpenBench’s opnbnchmark project offers free open-source tools for sanitizing regression datasets that implements best principles from published approaches.
Addressing Point 2
We firmly agree with the principle behind Pham-The et al.’s second argument: the best available in silico models should be used in early discovery decision making. However, we disagree that classification models are necessarily superior from a performance or utility perspective.
Indeed, at OpenBench, we believe there are several pertinent methodological issues lurking beneath the treatment of ADME-Tox prediction as a classification task that make it inferior to regression modeling:
Classification enforces sensitivity around inconsistent and potentially arbitrary thresholds
There is no apparent consensus on what threshold to use for each task
Classification output does not properly model in vitro assay output
Classification output limits chemists’ ability to reason with a predicted value in the context of a broader preclinical gestalt
Classification output does not give chemists the flexibility to design their own rules for high-throughput in silico screening
Classification enforces sensitivity around inconsistent and potentially arbitrary thresholds
Treating ADME-Tox prediction as a classification task enforces undue sensitivity around potentially arbitrary thresholds that demarcate classes. As an extreme example with Caco-2, if one picks a threshold of 8x10^-6 cm/s to separate High and Low classes, predicted permeability values of 7.9x10^-6 and 8.1x10^-6, while practically equivalent to a medicinal chemist, would be reported as Low and High respectively with no transparency into the underlying minor difference. In this case, the lack of nuance may motivate a course of action that is not in the best interest of the discovery program simply because of how the threshold is defined.
There is no apparent consensus on what threshold to use
Serving to aggravate issues with arbitrary enforcement, the research community has met little historical consensus on what the appropriate classification threshold should be for ADME-Tox tasks. Take Caco-2 permeability as an example. Pham-The et al.'s compilation of classification models in the figure below highlights inconsistency of threshold selection. 16x10^-6, 7.08x10^-6, 8x10^-6, 20x10^-6, and 1.1x10^-6 were variously used as thresholds between High and Moderate/Low permeability in Caco-2 classifiers developed between 2013 and 2017. The lack of consistency exacerbates the danger of predicting properties and reporting performance metrics against a single selected threshold.
Classification output does not properly model in vitro assay output
A model that predicts permeability as High or Low against an arbitrary threshold is not an effective substitute for an in vitro ADME-Tox measurement. When a discovery chemist receives in vitro Caco-2 data from the lab, for instance, it is reported as a numerical permeability estimate with a standard deviation that can be reasoned with in the context of other ADME-Tox data points, measured physicochemical properties, and efficacy results. Offering predictions consistent with chemists’ experience of receiving numerical permeability data is especially pertinent given our next argument.
Classification output limits chemists’ ability to reason with a predicted value in the context of a broader preclinical gestalt
Borrowing from physicians’ concept of ‘clinical gestalt’—a nice definition of which is given by Cook (2009)—preclinical gestalt captures the idea that medicinal chemists globally incorporate new assay data into an organized and coherent assessment of a drug hypothesis within seconds. The concept communicates seasoned medicinal chemists’ ability to recognize patterns learned over decades of experience and to apply those patterns to discovery situations in real time.
Discovery chemists do a good job dealing with incomplete information. They rarely see the complete picture as they develop SAR hypotheses; rather, they rely on experience and what limited data points they have to guide them towards defining a new invention. This is particularly impressive given the multi-objective nature of the drug discovery problem.
A numerical prediction for ADME-Tox endpoints contributes to the preclinical gestalt in a familiar way and takes full advantage of the decision-making apparatus that medicinal chemists train over the course of their careers. There is no reason to simplify a prediction to High/Moderate/Low categories. Medicinal chemists can incorporate and take full advantage of the nuance of a numerical prediction—what’s more, they’ve built their profession on incorporating noisy numerical data in the hunt to define new drug inventions. Delivering permeability as a mere High or Low does not take into account chemist’s professional experience and competence.
Classification output does not give chemists the flexibility to design their own rules for high-throughput in silico screening
But what about cases in which discovery teams want to use in silico tools in an automated fashion? Perhaps they want to screen large compound libraries without giving manual attention to each prediction (thus not rendering any preclinical gestalt). Given that seamless automation and scale are among the stated advantages of in silico methods, wouldn’t a High or Low / yes or no class be a more readily actionable ‘decision’ for high-throughput screens than a numerical value?
This line of reasoning only holds if you’re willing to depend upon the inconsistent thresholds determined by cheminformatics and machine learning researchers to decide yes/no criteria for your screening program. What most discovery teams prefer is to design the screening rules and thresholds themselves in a way that makes sense for their program and budget. The classification setting does not allow for such flexibility in application of high-throughput in silico models; only numerical regression output can provide this advantage.
Conclusion
There are any number of arguments that support the treatment of in silico ADME-Tox prediction as a regression task, and ultimately they all point to the same conclusion: numerical output provides more utility and flexibility to working chemists who use in silico ADME-Tox methods to support preclinical discovery. In cases where numerical data is readily available for model training, it should be put to use.
To learn more about in silico ADME-Tox prediction and how it can improve your discovery program, sign up to trial the OpenBench Lab: