Machine Learning & AI in Drug Discovery
Machine learning & artificial intelligence (AI) is the latest new new thing capturing interest in the pharmaceutical industry. Predicting patient responses from large data sets gathered by wearable devices or collected from image-based clinical tests such as MRI scans or pathology slides represent ripe problems to benefit from automated methods (see here, here and here). Algorithmic innovations, hardware advances providing increased compute power (GPUs and TPUs) and access to sufficiently large data sets have made these problems tractable.
To no one’s surprise, this enthusiasm is inspiring interest in applications for drug discovery. It is attractive to believe that combining AI tools with the large amounts of bioactivity data generated from high throughput screening, ‘omics technologies (genomics, metabolomics, proteomics, etc.) and other profiling methods will help solve the crisis in pharmaceutical productivity. With only 10% of clinical drug candidates making it to regulatory approval and launch, the inability to predict drug efficacy and safety is an expensive problem in search of solutions.
Applying AI to drug discovery, however, is significantly more complicated than image classification or predicting customer behavior. Human biology represents an advanced complex system (see this classic paper from Marie Csete and John Doyle from Caltech) having a layered architecture with many levels of organization and points of control, a modular design with redundant components and subsystems that are highly dynamical with complicated feedback and feedforward loops.
In order to develop useful predictive models of drug efficacy and safety, given the complexity of biological systems, it is crucial to incorporate external domain knowledge to provide the relevant context. We should avoid the trap of magical thinking (h/t Paul Clemons), believing that just having a better algorithm or a large enough data set will magically produce the right answer. Context is everything. An efficacy target in one indication is a toxicity target in another. Regulation of the NFkappaB pathway in the immune system is different than in the gut. In drug discovery, these differences can mean life or death.
There are many approaches to incorporate domain knowledge and encode context for data mining through feature engineering and the development of feature representations. Indeed, new methods to incorporate external domain knowledge into algorithm development will be an important and growing source of future innovation. In bioinformatics data analysis, GO annotations have been applied to assign genes to pathways, to generate pathway signatures that are subsequently used for analysis. For cheminformatics problems, chemical structures are represented by molecular descriptors developed from chemical properties, such as lipophilicity, numbers of rotatable bonds, etc., that are then used in building predictive models (e.g. QSAR models), see recent reviews here and here.
Efforts from other fields such as behavioral analysis also illustrate the use of external domain knowledge to build feature representations. The Datta lab at MIT is developing methods to represent mouse behavior data (walking, running, turning, etc.) captured on video during behavioral studies as motifs (or “syllables”) that can be combined in sequence and time. Reducing the feature space in this way not only makes the data more computationally attractive but also more intuitive to interpret.
Interpretability of predictive models and algorithms is particularly important in pharmaceutical research. In drug discovery, data types are diverse and the data available for validation may be insufficient or of problematic quality (remember the reproducibility crisis?). Also, many high value problems of interest (e.g. predicting drug induced liver injury) lack gold standards (ground truth). For these reasons, predictive models and results that build understanding of the underlying biological mechanisms are preferred over those that provide only good performance metrics.
In our own work with human primary cell-based phenotypic profiling data, we have used external drug information to develop signature motifs that are associated with particular molecular mechanisms or clinical outcomes. Interpretability of predictive models developed with these data is facilitated by the design of the assays themselves incorporating external domain knowledge in the selection of cell types, activation conditions and endpoints. If a particular feature or combination of features is identified to be highly important in a predictive algorithm, the ability to connect these directly to clinical knowledge facilitates interpretation and mechanistic understanding.
Although deep learning models to classify images (e.g. cats, etc.) have been successful in the absence of any domain knowledge (using domain-agnostic data-driven methods), emerging efforts to bring in external knowledge (e.g. incorporation of geometric constraints through the capsule networks of Google’s Geoff Hinton) are showing interesting performance gains. In digital pathology, while current efforts have been aimed at augmenting human-based analyses such as assigning tumor grade in a biopsy based on the number and location of cancer cells, creation and testing of feature vectors based on more sophisticated motifs, such as density of CD8+ T cells in the tumor periphery are being used to facilitate a broader understanding of disease mechanisms.
Given the complexity of human biology and the challenges of predicting drug action in people, advancing the applications of AI and machine learning in drug discovery will require effective incorporation of domain knowledge and biological context for the design of truly useful algorithms. Getting to this level is best accomplished by embedding AI engineers and data analysts directly within discovery scientific teams. Success will be achieved only through close collaboration between engineers, data analysts and drug discovery scientists. This is where the magic happens.
To keep up in this area, be sure to check out Data Science Central, KDnuggets, Kaggle, the Google Research Blog, a16z, as well as @EricTopol and @AndrewYNg on Twitter (check out Andrew’s recent presentation on The State of Artificial Intelligence here). For deeper learning (pun intended), there are a number of interesting courses available on Coursera, Udacity and Udemy.