Archive for the ‘Uncategorized’ Category
Deep learning has been getting some press in the last few months, especially with the Google paper on recognizing cats (amongst other things) from Youtube videos. The concepts underlying this machine learning approach have been around for many years, though recent work by Hinton and others have led to fast implementations of the algorithms as well as better theoretical understanding.
It took me a while to realize that deep learning is really about learning an optimal, abstract representation in an unsupervised fashion (in the general case), given a set of input features. The learned representation can be then used as input to any classifier. A key aspect to such learned representations is that they are, in general, agnostic with respect to the final task for which they are trained. In the Google “cat” project this meant that the final representation developed the concept of cats as well as faces. As pointed out by a colleague, Bengio et al have published an extensive and excellent review of this topic and Baldi also has a nice review on deep learning.
In any case, it didn’t take too long for this technique to be applied to chemical data. The recent Merck-Kaggle challenge was won by a group using deep learning, but neither their code nor approach was publicly described. A more useful discussion of deep learning in cheminformatics was recently published by Lusci et al where they develop a DAG representation of structures that is then fed to a recursive neural network (RNN). They then use the resultant representation and network model to predict aqueous solubility.
A key motivation for the new graph representation and deep learning approach was the observation
one cannot be certain that the current molecular descriptors capture all the relevant properties required for solubility prediction
A related motivation was that they desired to apply deep learning methods directly to the molecular graph, which in general, is of variable size compared to fixed length representations (fingerprints or descriptor sets). It’s an interesting approach and you can read the paper for more details, but a few things caught my eye:
- The motivation for the DAG based structure description didn’t seem very robust. Shouldn’t a learned representation be discoverable from a set of real-valued molecular descriptors (or even fingerprints)? While it is possible that all the physical aspects of aquous solubility may not be captured in the current repetoire of molecular descriptors, I’d think that most aspects are. Certainly some characterizations may be too time consuming (QM descriptors) for a cheminformatics setting.
- The results are not impressive, compared to pre-existing model for the datasets they used. This is all the more surprising given that the method is actually an ensemble of RNN’s. For example, in Table 2 the best RNN model has an R2 of 0.92 versus 0.91 for the pre-existing model (a 2D kernel). But R2 is usually a good metric for non-linear regression. But even the RMSE is only 0.03 units better than the pre-existing model.However, it is certainly true that the unsupervised nature of the representation learning step is quite attractive – this is evident in the case of the intrinsic solubility dataset, where they achieve similar results to the prior model. But the prior model employed a manually selected set of topological descriptors.
- It would’ve been very interesting to look at the transferabilty of the learned representation by using it to predict another physical property unrelated (at least directly) to solubility.
One characteristic of deep learning methods is that they work better when provided a lot of training data. With the exception of the Huuskonen dataset (4000 molecules), none of the datasets used were very large. If training set size is really an issue, the Burnham solubility dataset with 57K observations would have been a good benchmark.
Overall, I don’t think the actual predictions are too impressive using this approach. But the more important aspect of the paper is the ability to learn an internal representation in an unsupervised manner and the promise of transferability of such a representation. In a way, it’d be interesting to see what an abstract representation of a molecule could be like, analogous to what a deep network thinks a cat looks like.
While contributing to a book chapter on high content screening I came across the problem of characterizing screen quality. In a traditional assay development scenario the Z factor (or Z’) is used as one of the measures of assay performance (using the positive and negative control samples). The definition of Z’ is based on a 1-D readout, which is the case with most non-high content screens. But what happens when we have to deal with 10 or 20 readouts, which can commonly occur in a high content screen?
Assuming one has identified a small set of biologically relevant phenotypic parameters (from the tens or hundreds spit out by HCA software), it makes sense that one measure the assay performance in terms of the overall biology, rather than one specific aspect of the biology. In other words, a useful performance measure should be able to take into account multiple (preferably orthogonal) readouts. In fact, in many high content screening assays, the use of the traditional Z’ with a single readout leads to very low values suggesting a poor quality assay, when in fact, that is not the case if one were to consider the overall biology.
One approach that has been described in the literature is an extension of the Z’, termed the multivariate Z’. The approach was first described by Kummel et al, which develops an LDA model, trained on the positive and negative wells. Each well is described by N phenotypic parameters and the assumption is that one has pre-selected these parameters to be meaningful and relevant. The key to using the model for a Z’ calculation is to replace the N-dimensional values for a given well by the 1-dimensional linear projection of that well:
where is the 1-D projected value, is the weight for the ’th pheontypic parameter and is the value of the ’th parameter for the ’th well.
The projected value is then used in the Z’ calculation as usual. Kummel et al showed that this approach leads to better (i.e., higher) Z’ values compared to the use of univariate Z’. Subsequently, Kozak & Csucs extended this approach and used a kernel method to project the N-dimensional well values in a non-linear manner. Unsurprisingly, they show a better Z’ than what would be obtained via a linear projection.
And this is where I have my beef with these methods. In fact, a number of beefs:
- These methods are model based and so can suffer from over-fitting. No checks were made and if over-fitting were to occur one would obtain a falsely optimistic Z’
- These methods assert success when they perform better than a univariate Z’ or when a non-linear projection does better than a linear projection. But neither comparison is a true indication that they have captured the assay performance in an absolute sense. In other words, what is the “ground truth” that one should be aiming for, when developing multivariate Z’ methods? Given that the upper bound of Z’ is 1.0, one can imagine developing methods that give you increasing Z’ values – but does a method that gives Z’ close to 1 really mean a better assay? It seems that published efforts are measured relative to other implementations and not necessarily to an actual assay quality (however that is characterized).
- While the fundamental idea of separation of positive and negative control reponses as a measure of assay performance is good, methods that are based on learning this separation are at risk of generating overly optimistic assesments of performance.
As an example, I looked at a recent high content siRNA screen we ran that had 104 parameters associated with it. The first figure shows the Z’ calculated using each layer individually (excluding layers with abnormally low Z’)
As you can see, the highest Z’ is about 0.2. After removing those with no variation and members of correlated pairs I ended up with a set of 15 phenotypic parameters. If we compare the per-parameter distributions of the positive and negative control responses, we see very poor separation in all layers but one, as shown in the density plots below (the scales are all independent)
I then used these 15 parameters to build an LDA model and obtain a multivariate Z’ as described by Kummel et al. Now, the multivariate Z’ turns out to be 0.68, suggesting a well performing assay. I also performed MDS on the 15 parameter set to get lower dimensional (3D, 4D, 5D, 6D etc) datasets and performed the same calculation, leading to similar Z’ values (0.41 – 0.58)
But in fact, from the biological point of view, the assay performance was quite poor due to poor performance of the positive control (we haven’t found a good one yet). In practice then, the model based multivariate Z’ (at least as described by Kummel et al can be misleading. One could argue that I had not chosen an appropriate set of phenotypic parameters – but I checkout a variety of other subsets (though not exhaustively) and I got similar Z’ values.
Of course, it’s easy to complain and while I haven’t worked out a rigorous alternative, the idea of describing the distance between multivariate distributions as a measure of assay performance (as opposed to learning the separation) allows us to attack the problem in a variety of ways. There is a nice discussion on StackExchange regarding this exact question. Some possibilities include
- Bhattacharya distance
- Mahalanobis distance
- Mantel test (though this is really a measure of correlation than a measure of effect size)
- The cross match test by Paul Rosenbaum (with a handy R package) – though this is more a measure of whether two distributions are different or not, rather than a distance between distributions
- An approach described by Loudin & Miettinen based on kernel density estimates and a 1-D Kolmogorov Smirnov test
It might be useful to perform a more comprehensive investigation of these methods as a way to measure assay performance
Another ACS National meeting is over, this time in San Diego. It was good to catch up with old friends and meet many new, interesting people. As I was there for a relatively short period, I bounced around most sessions.
MEDI and COMP had a joint session on desktop modeling and its utility in medicinal chemistry. Anthony Nicholls gave an excellent talk, where he differentiated between “strong signals” and “weak signals”, the former being extremely obvious trends, features or facts that do not require a high degree of specialized exerptise to detect and the latter being those that do require significantly more expertise to identify. An example of a strong signal would be an empty region of a binding pocket that is not occupied by a ligand feature – it’s pretty easy to spot this and when hihglighted the possible actions are also obvious. A weak signal could be a pi-stacking interaction which could be difficult to identify in a crowded 3D diagram. He then highlighted how simple modifications to traditional 2D depictions can be used to make the obvious more obvious and make features that might be subtle, say in 3D, more obvious in a 2D depiction. Overall, an elegant talk, that focused on how simple visual cues in 2D & pseudo-3D depictions can key the mind to focus on important elements.
There were two other symposia that were of particular interest. On Sunday Shuxing Zhang and Sean Eakins organized a symposium on polypharmacology with an excellent line up of speakers including Chris Lipinski. Curt Breneman gave a nice talk that highlighted best practices in QSAR modeling and Marti Head gave a great talk on the role and value of docking in computational modeling projects.
On Tuesday, Jan Kuras and Tudor Oprea organized a session on System Chemical Biology. Though the session appeared to be more on the lines of drug repurposing, there were several interesting talks. Ebelebola May from Sandia Labs gave a very interesting talk on a system level model of small molecule inhibition of M. Tuberculosis and F. Tularensis - combining metabolic pathway models and cheminformatics.
John Overington gave a very interesting talk on identifying drug combinations to improve safety. Contrary to much of my reading in this area, he points out the value of “me-too” drugs and taking combinations of such drugs. Given that such drugs hit the same target, he pointed out that this results in the fact that off-targets will see reduced concentrations of the individual drugs (hopefully reducing side effects) while the on-target will see the pooled concentration (thus maintaining efficacy (?)). It’s definitely a contrasting view to the one where we identify combinations of drugs hitting different targets (which I’d guess is a tougher proposition, since identifying a truly synergistic combination requires a detailed knowledge of the underlying pathways and interactions). He also pointed out that his analyses indicated that combination dosing is not actually reduced, in contrast to the current dogma.
As before we had a CINFlash session which I think went quite well – 8 diverse speakers with a pretty good audience. The slides of the talks have been made available and we plan to have another session in Philadelphia this Fall, so consider submitting something. We also had a great Scholarships for Scientific Excellence poster session – 15 posters covering topics ranging from reaction prediction to an analysis of retractions. Excellent work, and very encouraging to see newcomers to CINF interested in getting more invovled.
The only downsides to the meeting was the chilly and unsunny weather and the fact that people still think that displaying tables of numbers in a slide actually transmits any information!
The time has come to move again – though, in this case, it’s just a geographic move. From August I’ll be living in Manchester, CT (great cheeseburgers and lovely cycle routes) and will continue to work remotely for NCGC. I’ll be travelling to DC every month or so. The rest of the time I’ll be working from Connecticut.
Being new to the area, it’d be great to meet up over a beer, with people in the surrounding areas (NY/CT/RI) doing cheminformatics, predictive modeling and other life science related topics (any R user groups in the area?). If anybody’s interested, drop me a line (comment, mail or @rguha).