The Road Behind, The Road Ahead

Reflections on OpenBench's first year and a Re-dedication of ADME Talks

Feb 02, 2021

ADME Talks is a newsletter democratizing the cutting edge in virtual screening, with a focus on ADME-Tox (pun intended.) In this edition, we pause our regularly scheduled programming for a word from OpenBench CEO Jim Thompson.

Were you forwarded this post? Please subscribe:

Looking Back

In late 2019, James and I started OpenBench to accelerate the adoption of machine learning (ML) to support drug discovery. We had met prior success building actionable molecular property predictions on a consulting basis and were inspired to package and scale our approach through a cost-effective software platform.

It was apparent to us then, as it is now, that in silico characterization of ADME-Tox properties—when done well—could improve the efficiency of discovery efforts and create value for small molecule programs. That said, merely understanding what it meant to do ADME-Tox prediction well was itself a challenge. A paucity of standard benchmarks for ADME tasks muddled progress. Those benchmarks that did exist were prone to bias, and success on said benchmarks did not necessarily translate to good predictions in novel chemical space.

To address these challenges, we began a no-risk trial program in the second half of last year that allowed ML-curious biotech companies to build and validate one bespoke ADME endpoint in their chemical space of interest. In the absence of meaningful benchmarks, we helped clients assess the performance of models on the compounds that mattered most: their own.

The trial program was and is a great success. Among our target market segment (Seed Series to Series B small molecule discovery companies) 100% of trial clients converted to long-term engagements.

As we started running trials and growing our client base, new insights and challenges emerged. Our belief that open ADME-Tox benchmarks would be crucial to usher in a prediction paradise grew ever more profound. In response, we founded the still nascent Small Molecules of the Month (SMOTM) benchmark series, a companion to Dennis Hu’s wonderful highlights from the medicinal chemistry literature. We’re confident that under the watchful guidance of Zach Sheldon, our Head of Chemistry, the SMOTM benchmark will blossom into a consistent and substantial contribution to in silico ADME-Tox benchmarking.

Engaging with more and more small molecule biotechs, we further realized the technical challenges facing drug discovery groups trying to implement their own virtual screening tech in-house. Even for our team of software engineers, ML engineers, and data scientists, building the tools and computational infrastructure necessary to deploy ML solutions at scale for the users of our Lab software is challenging. Among biotechs for whom software engineering isn’t a core competency, it’s nigh impossible. The difficulties we’ve identified are three-fold:

Dataset development and curation from literature is tricky.
Engineering and maintaining infrastructure to train and predict at any scale is onerous.
Discovery scientists need a graphical interface to get the most out of their models.

Dataset Development

We have found that many organizations—particularly startups—lack the experimental data needed to train quality models on the properties most relevant to their discovery efforts. Our solution is to leverage publicly available data to develop actionable models that require as few as a dozen in-house data points from the client.

As public databases like ChEMBL and PubChem continue to grow, the data resources available therein remain underutilized, especially for ADME-Tox modeling. We’ve been encouraged by our success in scraping together training datasets that are of sufficient quality to build working models for a variety of molecular property prediction tasks, ranging from in vitro ADME-Tox to in vivo animal PK.

To automate data aggregation and curation, we built a “scraping wizard” that we use internally to construct new datasets for our users. It is a basic command line tool that surfaces relevant assay data for a given search query, lets the user indicate which assays contain model-quality data, and handles unit and structure standardization. Our goal is to make it easier to unlock the value stored in the world’s vast collections of scientific literature data. Instead of combing through stacks of research papers, we envision a world where scientists can easily curate large, high-quality experimental datasets, and let ML models do the work of deriving scientific insights.

ML at Scale

Even with a quality dataset in hand, training and deploying ML models is a non-trivial endeavor. For datasets containing any more than a few thousand molecules, it is impractical to train and optimize state-of-the-art ML architectures like D-MPNNs and GCNs on personal computers. Cloud-based, accelerated-computing resources make this process feasible. Many of the predictive offerings on the market today lack native integration of cloud compute resources. For those without programming experience or who are not familiar with navigating the Linux operating system, going it alone and training a model in the cloud presents a steep learning curve.

After a model has been trained, some of the most significant technical challenges still remain, particularly for chemists who wish to screen large virtual compound libraries. Today, billions of molecules are available to order from chemical vendors like Enamine–REAL Space boasts 15.5B make-on-demand compounds. While the size of these libraries has grown by orders of magnitude in just the last five years, computational methods for triaging these libraries have not kept pace. Comprehensive virtual screening of billions of molecules is unaffordable for most drug discovery teams. Without technological innovation putting the proper set of tools in the hands of drug hunters, novel therapeutically viable chemical space will remain unvetted and unexploited.

We’re determined to make ML training and inference available to discovery scientists at the click of a button, no matter the scale that their work demands. We bake purpose-built cloud compute infrastructure into our Lab application so that our users can focus on chemistry and biology, rather than worrying about server orchestration and cloud engineering. Recently, we’ve expanded the scope of these efforts to include virtual screening services for hit identification in early stage drug discovery. By making it possible to comprehensively screen the largest virtual libraries accurately, quickly, and affordably, we hope to help medicinal chemists leverage the entirety of the chemical space that is now at their fingertips.

Scientist-Friendly Interface

The technical complexities of ML, especially those brought about by the last two years of advances in purpose-built deep learning architectures for molecular property prediction, create a barrier to entry for drug discovery scientists who want to use top-notch in silico tools but don’t have the requisite programming skills. At a minimum, using the cutting-edge open source solutions requires a facility with shell commands and a willingness to debug sometimes brittle code.

To get medicinal chemistry and DMPK decision-makers into the loop, it’s vital to develop a graphical user interface that abstracts away these technical barriers. We’ve found that even a simple interface can be powerful, and that a ChemDraw style canvas interface is key for scientists hoping to quickly iterate on SAR hypotheses in silico.

Moving Forward

As we move ahead into 2021, our specific focus on ADME-Tox property prediction will wane in favor of a more general mission: to handle all aspects of virtual screening so you can focus on discovering drugs. We will continue building the tools drug hunters need to unlock the full value of literature data resources, run virtual screening campaigns at any scale, and get top notch predictions in an easy-to-use interface.

As for the ADME Talks blog, we will keep up our SMOTM series to benchmark ADME-Tox modeling approaches on the latest medchem highlights. We will also continue to develop and post our own case studies. As usual, all supporting resources will be made available in our opnbnchmark Github repository to encourage others to reproduce our work and validate their own approaches on some of the most challenging molecular property prediction tasks. We know this open science effort will continue to mature in the new year under the sure leadership of our very own Colton Swingle, who is extending his Stanford education by one year to complete a joint BS/MS in Bioengineering and Computer Science.

We plan to invite guest authors to contribute to ADME Talks on a periodic basis. As our readership continues to grow, we want to diversify the perspectives that we offer, and lend a platform for people who are doing impactful work in the field of computational chemistry. If you are interested in contributing, please send us an email at founders@opnbnch.com—we would love to hear from you!

As always, thank you for your attention and support. We look forward to continuing to publish what we hope will be interesting and thought-provoking content in the year ahead as we strive to improve the quality and accessibility of virtual screening technology in small molecule discovery!

To try our virtual screening solution for yourself, sign up for a trial consultation at our website:

Success-Driven Molecular Discovery

Discussion about this post