Learning Representations – Digits, Cats and Now Molecules

Deep learning has been getting some press in the last few months, especially with the Google paper on recognizing cats (amongst other things) from Youtube videos. The concepts underlying this machine learning approach have been around for many years, though recent work by Hinton and others have led to fast implementations of the algorithms as well as better theoretical understanding.

It took me a while to realize that deep learning is really about learning an optimal, abstract representation in an unsupervised fashion (in the general case), given a set of input features. The learned representation can be then used as input to any classifier. A key aspect to such learned representations is that they are, in general, agnostic with respect to the final task for which they are trained. In the Google “cat” project this meant that the final representation developed the concept of cats as well as faces. As pointed out by a colleague, Bengio et al have published an extensive and excellent review of this topic and Baldi also has a nice review on deep learning.

In any case, it didn’t take too long for this technique to be applied to chemical data. The recent Merck-Kaggle challenge was won by a group using deep learning, but neither their code nor approach was publicly described. A more useful discussion of deep learning in cheminformatics was recently published by Lusci et al where they develop a DAG representation of structures that is then fed to a recursive neural network (RNN). They then use the resultant representation and network model to predict aqueous solubility.

A key motivation for the new graph representation and deep learning approach was the observation

one cannot be certain that the current molecular descriptors capture all the relevant properties required for solubility prediction

A related motivation was that they desired to apply deep learning methods directly to the molecular graph, which in general, is of variable size compared to fixed length representations (fingerprints or descriptor sets). It’s an interesting approach and you can read the paper for more details, but a few things caught my eye:

  • The motivation for the DAG based structure description didn’t seem very robust. Shouldn’t a learned representation be discoverable from a set of real-valued molecular descriptors (or even fingerprints)? While it is possible that all the physical aspects of aquous solubility may not be captured in the current repetoire of molecular descriptors, I’d think that most aspects are. Certainly some characterizations may be too time consuming (QM descriptors) for a cheminformatics setting.
  • The results are not impressive, compared to pre-existing model for the datasets they used. This is all the more surprising given that the method is actually an ensemble of RNN’s. For example, in Table 2 the best RNN model has an R2 of 0.92 versus 0.91 for the pre-existing model (a 2D kernel). But R2 is usually a good metric for non-linear regression. But even the RMSE is only 0.03 units better than the pre-existing model.However, it is certainly true that the unsupervised nature of the representation learning step is quite attractive – this is evident in the case of the intrinsic solubility dataset, where they achieve similar results to the prior model. But the prior model employed a manually selected set of topological descriptors.
  • It would’ve been very interesting to look at the transferabilty of the learned representation by using it to predict another physical property unrelated (at least directly) to solubility.

One characteristic of deep learning methods is that they work better when provided a lot of training data. With the exception of the Huuskonen dataset (4000 molecules), none of the datasets used were very large. If training set size is really an issue, the Burnham solubility dataset with 57K observations would have been a good benchmark.

Overall, I don’t think the actual predictions are too impressive using this approach. But the more important aspect of the paper is the ability to learn an internal representation in an unsupervised manner and the promise of transferability of such a representation. In a way, it’d be interesting to see what an abstract representation of a molecule could be like, analogous to what a deep network thinks a cat looks like.

3 thoughts on “Learning Representations – Digits, Cats and Now Molecules

  1. Tobias says:

    HI Rajarshi,
    thanks for the write up, always interesting to learn something new and get some new perspectives and pointers to data and publications.

    I actually liked that Lusci, Pollastri and Baldi provided their data to reproduce their findings and also their source code of UG-RNN at AquaSol http://cdb.ics.uci.edu/

    Looking at the source code (Python and C++) it becomes quite clear that for me as a practitioner it would be hard or impossible to implement that from scratch, but I can use, change or apply it.

    With the vague description from the Merck-Kaggle team I cant do anything, well besides learning everything by myself, which I wont do (so little time, so much to do).

    Thanks for the link to the Burnham data, which let me stumble upon your paper (10.1016/j.bmc.2011.05.005). I think its incredible how much data is available now, compared to a couple of years ago.

    Cheers
    Tobias

  2. I do not know “deep learning” much. But it sounds to me that it basically addresses the point that the representation of your system determines how much detail your model can learn. That is nothing new.

    So, from your introduction I understand that deep learning learns a new representation? But the Lusci did not have the system learn a new representation, they just selected a new representation?

    Now, moving from a graph to a graph representation does not sound to be a significant difference, so that lack of improvement in prediction performance does not sounds unexpected. For example, a DAG needs to be adapted to support delocalization just as well.

    But, what I am wondering more is my observation that you cannot split out representation from modeling method. That is, the representation determines how easily the modeling method can learn the patterns. Not the information content, but how the information is represented too.

  3. Egon, indeed, it’s all about representation in the end.

    The thing is that rather than just go with the input representation (say N descriptors) a deep learning system is supposed to a learn an abstract form of the representation. In some ways it’s like PCA or MDS.

    Agreed about the new graph descriptor – not sure whether it was necessary for their modeling method; or they just decided to add one more descriptor to the literature :)

    On your last point, my understanding is that deep learning is (in general) independent of the modeling method that uses the learned representation. This doesn’t avoid the fact that some modeling methods might do better than others

Leave a Reply

Your email address will not be published. Required fields are marked *