Deep Learning in Chemistry

Deep learning (DL) is all the rage these days and this approach to predictive modeling is being applied to a wide variety of problems, including many in computational drug discovery. As a dilettante in the area of deep learning, I’ve been following papers that have used DL for cheminformatics problems, and thought I’d mention a few that seemed interesting.

An obvious outcome of a DL model is more accurate predictions, and as a result most applications of DL in drug discovery have focused on the use of DL models as more accurate regression or classification models. Examples include Lusci et al [2013], Xu et al [2015] and Ma et al [2015]. It’s interesting to note that in these papers, while DL models show better performance, it’s not consistent and the actual increase in performance is not necessarily very large (for the effort required). Eakins [2016] has reviewed the use of DL models in QSAR settings and more recently Winkler & Le [2016] have also briefly reviewed this area.

However, simply replacing one regression method with another is not particularly interesting. Indeed, as pointed by several workers (e.g., Shao et al [2013]) input descriptors, rather than modeling method, have greater effect on predictive accuracy. And so it’s the topic of representation learning that I think DL methods become interesting and useful in the area of cheminformatics.

Several groups have published work on using DL methods to learn a representation of the molecular structure, directly from the graph representation. Duvenaud et al [2016] and Kearnes et al [2016] both have described these approaches and the nice thing is that this alleviates the need to choose and select features a priori. The downside is that the learned features are optimal in the context of the training data (thus necessitating large training sets to allow for learned features that are generalizable). Interestingly, on reading Kearnes et al [2016], the features that are learned by the DL model are conceptually similar to circular fingerprints. More interestingly, when they built predictive neural network models using the learned representation, the RMSE was not significantly different from a random forest model using circular fingerprints. Of course, the learned representation is driven by the architecture of the DL model, which was designed to look at atom neighborhoods, so it’s probably not too surprising that the optimal representations was essentially equivalent to a circular fingerprint. But one can expect that tweaking the DL architecture and going beyond the molecular graph could lead to more useful representations. Also, this paper very clearly describes the hows and whys of designing a deep neural network architecture, and is useful for someone interested in exploring further.

Another interesting development is the use of DL to learn a continuous representation of a molecular structure, that can then be modified (usually in a manner to vary some molecular property) and “decoded” to obtain a new chemical structure with the desired molecular property. This falls into the class of inverse QSAR problems and Gomez-Bombarelli et al [2016] present a nice example of this approach, where gradient descent is used to explore chemical space defined by the learned continuous representation. Unfortunately the chemistry represented by the generated structures has several problems as described by Derek Lowe. While this problem has been addressed before (e.g., Wong et al [2009] with SVM, Miyao et al [2016], Skvortsova et al [1993]), these efforts have started with pre-defined feature sets. The current works key contribution is the ability to generate a continuous chemical space and I assume the nonsensical regions of the space could be avoided using appropriate filters.

Winkler & Le [2016] recently reported a comparison of deep and shallow neural networks for QSAR regression. Their results and conclusions are similar to previous work. But more tantalizingly, they make the claim that DNN’s may be better suited to tackle the prediction of activity cliffs. There has been some work on this topic (Guha [2012] and Heikamp et al [2012]) but given that activity cliffs are essentially discontinuities in a SAR surface (either fundamentally or by choice of descriptors), traditional predictive models are unlikely to do well. Winkler & Le point to work that suggests that activity cliffs may “disappear” if an appropriately high dimensionality descriptor space is used, and conclude that learned representations via DL may be useful for this. Though I don’t discount this, I’m not convinced that simply moving to higher dimensional spaces is sufficient (or even necessary) – if it were, SVM‘s should be good at predicting activity cliffs. Rather, it’s the correct set of features, that captures the phenomenon underlying the cliff, that are necessary. Nonetheless, Winkler & Le [2016] raise some interesting questions regarding the smoothness of chemical spaces.

A Report from a Stranger in a Strange Land

I just got back from ACoP7, the yearly meeting of the International Society of Pharmacometrics (ISoP). Now, I don’t do any PK/PD modeling (hence the “strange land”) but was invited to talk about our high throughput screening platform for drug combinations. I also hoped to learn a little more about this field as well as get an idea of the state of quantitative systems pharmacology (QSP). This post is a short summary of some aspects of the meeting and the PK/PD field that caught my eye, especially as an outsider to the field (hence the “stranger”).

The practice of PK/PD is clearly quite a bit downstream in the drug development pipeline from where I work, though it can be beneficial to keep PK/PD aspects in mind even at the lead discovery/optimization stages. However I did come across a number of talks and posters that were attempting to bridge pre-clinical and clinical stages (and in some cases, even making use of in vitro) data. As a result the types of problems being considered were interesting and varied – ranging from models of feeding to predict weight loss/gain in neonates to analyzing drug exposure using mechanistic models.

A lot of PK/PD problems are addressed using model based methods, as opposed to machine learning methods (see Breiman, 2001). I have some familiarity with the types of statistics used, but in practice much of my work is better suited for machine learning approaches. However, I did come across nice examples of some methodologies that may be useful in QSAR type settings – including mixed effect models, IRT models and Bayesian methods. It was also nice to see a lot of people using R (ISoP even runs a Shiny server for members’ applications) and companies providing R solutions (e.g., Metrum, Mango) and came across a nice poster (Justin Penzenstadler, UMBC) comparing various R packages for NLME modeling. I also came across Stan, which seems like a good way to get into Bayesian modeling. Certainly worth exploring nore.

The data used in a lot of PK/PD problems is also qualitatively (and quantitatively) different from my world of HTS and virtual screening. Datasets tend to be smaller and noiser, which are challenging to model (hence less focus on purely data driven, distribution-free M/L methods). A number of presentations showed results with quite wide CI’s and significant variance in the observed properties. At the same time, models tend to be smaller in terms of features, which are usually driven by the disease state or the biology being modeled. This is in contrast to the 1000’s of descriptors we deal with in QSAR. However, even with smaller feature sets I got the impression that feature selection (aka covariate selection) is a challenge.

Finally, I was interested in learning more about QSP. Having followed this topic on and off (my initiation was this white paper), I wasn’t really up to date and was a bit confused between QSP and phsyiologically based PK (PBPK) models, and hoped this meeting would clarify things a bit. Some of the key points I was able to garner

  • QSP models could be used to model PK/PD but don’t have to. This seems to be the key distinction between QSP and PBPK approaches
  • Building a comprehensive model from scratch is daunting, and speaking to a number of presenters, it turns out many tend to reuse published models and tweak them for their specific system. (this also leads one to ask what is “useful”?)
  • Some models can be very complex – 100’s of ODE‘s and there were posters that went with such large models but also some that went with smaller simplified models. It seems that one can ask “How big a model should you go for to get accurate results?” as well as “How small a model can you get away with to get accurate results?“. Model reduction/compression seems to be an actively addressed topic
  • One of the biggest challenges for QSP models is the parametrization – which appears to be a mix of literature hunting, guesswork and some experiment. Examples where the researcher used genomic or proteomics data (e.g. Jaehee Shim, Mount Sinai) were more familiar to me, but nonetheless, daunting to someone who would like to use some of this work, but is not an expert in the field (or a grad student who doesn’t sleep). PK/PD models tend to require fewer parameters, though PBPK models are more closer to QSP approaches in terms of their parameter space.
  • Where does one find models and parameters in reusable (aka machine readable) formats? This is an open problem and approaches such as DDMoRE are addressing this with a repository and annotation specifications.
  • Much of QSP modeling is done in Matlab (and many published models are in the form of Matlab code, rather than a more general/abstract model specification). I didn’t really see alternative approaches (e.g., agent based models) to QSP models beyond the ODE approach.
  • ISoP has a QSP SIG which looks like an interesting place to hang out. They’ve put out some papers that clarify aspects of QSP (e.g., a QSP workflow) and lay out a roadmap for future activities.

So, QSP is very attractive since it has the promise of supporting mechanistic understanding of drug effects but also allowing one to capture emergent effects. However, it appears to be very problem & condition specific and it’s not clear to me how detailed I’d need to get to reach an informative model. It’s certainly not something I can pull off-the-shelf and include in my projects. But definitely worth tracking and exploring more.

Overall, it was a nice experience and quite interesting to see the current state of the art in PK/PD/QSP and learn about the challenges and successes that people are having in this area. (Also, ISoP really should make abstracts publicly linkable).

From Algorithmic Fairness to QSAR Models

The topic of algorithmic fairness has started recieving a lot of attention due to the ability of predictive models to make decisions that might discriminate against certain classes of people. The reasons for this include biased training data, correlated descriptors, black box modeling methods or a combination of all three. Research into algorithmic fairness attempts to identify these causes (whether in the data or the methods used to analyze them) and alleviate the problem. See here, here and here for some interesting discussions.

Thus I recently came across a paper from Adler et al on the topic of algorithmic fairness. Fundamentally the authors were looking at descriptor influence in binary classification models. Importantly, they treat the models as black boxes and quantify the sensitivity of the model to feature subsets without retraining the model. Clearly, this could be useful in analyzing QSAR models, where we are interested in the effect of individual descriptors on the predictive ability of the models. While there has been work on characterizing descriptor importance, all of them involve retraining the model with scrambled or randomized descriptors.

The core of Adler et al is their statement that

the information content of a feature can be estimated by trying to predict it from the remaining features.

Fundamentally, what they appear to be quantifying is the extent of multivariate correlations between subsets of features. They propose a method to “obscure the influence of a feature on an outcome” and using this, measure the difference in model prediction accuracy between the test set using the obscured variable and the original (i.e., unobscured) test set. Doing this for each feature in the dataset lets them rank the features. A key step of the process is to obscure individual features, which they term ε-obscurity. The paper presents the algorithms and also links to an implementation.

The authors test their approach on several datasets, including a QSAR-type dataset from the Dark Reactions Project. It would be interesting to compare this method, on other QSAR datasets, with simpler methods such as descriptor scrambling or resampling (from the same distribution as the descriptor) since these methods could be easily adapted to the black box assumption used by the authors.

Furthermore, given that their motivation appears to be driven by capturing multivariate correlation, one could take a feature \(X_i\) and regress all the other features \(X_j\ (j \neq i)\) on it. Repeating this for all \(X_i\) would then allow us to rank the features in terms of the RMSE of the individual regressions. Features with low RMSE would represent those that are succesfully estimated from the remaining features. This would test for (possibly non-linear) correlations within the dataset itself (which is conceptually similar to previous work from these authors) but not say anything about the model itself having learnt any such correlations. (Obviously, this works for numerical features only – but that is usually the case for QSAR models).

Finally, a question that seemed to be unanswered in the paper was, what does one do when one identifies a feature that is important (or, that can be predicted from the other features)? In the context of algorithmic fairness, such a feature could lead to discriminatory outcomes (e.g., zipcode as a proxy for race). What does one do in such a case?

Database Licensing & Sustainability

Update (07/28/16): DrugBank/OMx have updated the licensing conditions for DrugBank data in response to concerns raised earlier by various people and groups. See here for a detailed response from Craig Knox

A few days back I came across, via my Twitter network, the news that DrugBank had changed their licensing policy to CC BY-SA-NC 4.0. As such this is not a remarkable change (though one could argue about the NC clause, since as John Overington points out the distinction between commercial and non-commercial usage can be murky). However, on top of this license, the EULA listed a number of more restrictive conditions on reuse of the data. See this thread on ThinkLab for a more detailed discussion and breakdown.

This led to discussion amongst a variety of people regarding the sustainability of data resources. In this case while DrugBank was (and is) funded by federal grants, these are not guaranteed in perpetuity. And thus DrugBank, and indeed any resource, needs to have a plan to sustain itself. Charging for commercial access is one such approach. While it can be  problematic for reuse and other Open projects, one cannot fault the developers if they choose a path that enables them to continue to build upon their work.

Interestingly, the Guide to Pharmacology resource posted a response to the DrugBank license change, in which they don’t comment on the DrugBank decision but do point out that

The British Pharmacological Society (BPS) has committed support for GtoPdb until 2020 and the Wellcome Trust support for GtoImmuPdb until 2018. Needless to say the management team (between, IUPHAR, BPS and the University of Edinburgh) are engaged in sustainability planning beyond those dates. We have also just applied for UK ELIXIR Node consideration.

So it’s nice to see that the resource is completely free of any onerous restrictions until 2020. I have no doubt that the management team will be working hard to secure funding beyond that date. But in case they don’t, will their licensing also change to support some form of commercialization? Certainly, other resources are going down that path. John Overington pointed to BioCyc switching to a subscription model

 

So the sustainability of data resources is an ongoing problem, and will become a bigger issue as the links between resources grows over time. Economic considerations would suggest that permanent funding of every database  cannot happen.

So clearly, some resources will win and some will lose, and the winners will not stay winners forever.

Open source software & transferring leadership

However in contrast to databases, many Open Source software projects do continue development over pretty long time periods. Some of these projects receive public funding and also provide dual licensing options, allowing for income from industrial users.

However there are others which are not heavily funded, yet continue to develop. My favorite example is Jmol which has been in existence for more than 15 years and has remained completely Open Source. One of the key features of this project is that the leadership has passed from one individual to another over the years, starting I think with Dan Gezelter, then Bradley Smith, Egon Willighagen, Miguel Rojas and currently Bob Hanson.

Comparing Open software to Open databases is not fully correct. But this notion of leadership transition is something that could play a useful role in sustaining databases. Thus, if group X cannot raise funding for continued development, maybe group Y (that obviously benefits from the database) that has funding, could take over development and maintenance.

There are obvious reasons that this won’t work – maybe the expertise resides only in group X? I doubt this is really an issue, at least for non-niche databases. One could also argue that this approach is a sort of proto-crowdsourcing approach. While crowdsourcing did come up in the Twitter thread, I’m not convinced this is a scalable approach to sustainability. The “diffuse motivation” of a crowd is quite distinct from the “focused motivation” of a dedicated group. And on top of that, many databases are specialized and the relevant crowd is rather small.

One ultimate solution is that governments host databases in perpetuity. This raises a myriad issues. Does it imply storage and no development? Is this for all publicly funded databases? Or a subset? Who are the chosen ones? And of course, how long will the government pay for it? The NIH Commons, while not being designed for database persistence, is one such prototypical infrastructure that could start addressing these questions.

In conclusion, the issue of database sustainability is problematic and unsolved and the problem is only going to get worse. While unfortunate for Open science (and science in general) the commercialization of databases will always be a possibility. One hopes that in such cases, a balance will be struck between income and free (re)usage of these valuable resources.

SLAS 2017: Let There Be Light: Informatics Approaches to Exploring the Dark Genome

I’m organizing a symposium at the 2017 SLAS meeting in Washington D.C in the Data Analysis and Informatics track. The topic of the symposium focuses on informatics approaches that shed light and explore the dark genome. The description is given below, and I hope you’ll consider submitting an abstract.

With efforts such as the NIH-funded Illuminating the Druggable Genome (IDG) program, there is great interest and a pressing need to understand the “dark genome” — the subset of genes that have little to no information about them in the literature or databases. This session will focus on current efforts by members of the IDG program and the community in general on developing informatics resources for data aggregation and integration, target prioritization and platform development. In addition, topics such as characterization of druggability and novel approaches to connecting heterogeneous datasets that allow us to shed light on the dark genome will be considered.

The deadline is Aug 8, 2016 and you can submit an abstract here.