Support for feature,count fingerprints in fingerprint 3.5.0

I’ve just updated the fingerprint package to v3.5.0 (should show up on CRAN shortly, or else you can get it directly from my Github repository). The main update in this version is better support for feature,count type fingerprints. An example would be ECFP or signature fingerprints. In these types of fingerprints, the output is usually a set of (integer or long) hash values or else structural fragments along with their count of occurrences.

The updated package now provides an S4 class to represent features and their counts. An example of this class is

1
2
3
f1 <- new("feature",
          feature="[C]([N]([C]([N]([C][C,1](=[O]))=[O])[C](=[C]([C,1][N]([C,0]))[N](=[C,0]))))",
          count=as.integer(2))

The package provides getters and setters for these objects, allow you to get or set the feature and the count.

1
2
3
4
5
6
7
8
> feature(f1)
[1] "[C]([N]([C]([N]([C][C,1](=[O]))=[O])[C](=[C]([C,1][N]([C,0]))[N](=[C,0]))))"
> count(f1)
[1] 2
> feature(f1) <- 'ABCD'
> count(f1) <- 12
> f1
ABCD:12

Using this class, feature,count fingerprints are now represented as objects of class featvec. For these fingerprints, instead of bits, one obtains a list of feature objects. For fingerprints read from files that provide the hashed version of the underlying structure (or neighborhood etc), the numeric hashes are read in as features, with a default count of 1. The distance method has also been updated to evaluate similarities for feature,count fingerprints, though currently it does not use the count in the similarity calculation.

As an example, consider a set of ECFP’s available from here

1
2
3
4
5
6
7
8
9
10
> fps <- fp.read('http://pastebin.com/raw.php?i=gHjTQNKP', lf=ecfp.lf, binary=FALSE)
> fps[[1]]
Feature fingerprint
 name =  mol01
 source =  ecfp.lf
 features =  17:1 0:1 16:1 3:1 1:1 1747237384:1 1499521844:1 -1539132615:1 1294255210:1 332760439:1 -1549163031:1 1035613116:1 1618154665:1 590925877:1 1872154524:1 -1143715940:1 203677720:1 -1272768868:1 136120670:1 136597326:1 -1460348762:1 -1262922302:1 -1201618245:1 -402549409:1 -1270820019:1 929601590:1 -1597477966:1 -1274743746:1 -1155471474:1 1258428229:1 -1838187238:1 -798628285:1 -1773728142:1 -773983804:1 -453677277:1 1674451008:1 65948508:1 991735244:1 -1412946825:1 846704869:1 -2103621484:1 -886204842:1 1725648567:1 -353343892:1 -585443181:1 -533273616:1 2031084733:1 -801248129:1 1752802620:1 -976015189:1 -992213424:1 2109043264:1 -790336137:1 630139722:1 -505031736:1 -1427697183:1 -2090462286:1 -1724769936:1
> distance(fps[[1]], fps[[1]])
[1] 1
> distance(fps[[1]], fps[[2]])
[1] 0.1566265

Life and death in a screening campaign

So, how do I enjoy my first day of furlough? Go out for a nice ride. And then read up on some statistics. More specifically, I was browsing the The R Book  and came across survival models. Such models are used to characterize time to events, where an event could be death of a patient or failure of a part and so on. In these types of models the dependent variable is the number of time units that pass till the event in question occurs. Usually the goal is to model the time to death (or failure) as a function of some properties of the individuals.

It occurred to me that molecules in a drug development pipeline also face a metaphorical life and death. More specifically, a drug development pipeline consists of a series of assays – primary, primary confirmation, secondary (orthogonal), ADME panel, animal model and so on. Each assay can be thought of as representing a time point in the screening campaign at which a compound could be discarded (“death”) or selected (“survived”) for further screening. While there are obvious reasons for why some compounds get selected from an assay and others do not (beyond just showing activity), it would be useful if we could quantify how molecular properties affect the number and types of compounds making it to the end of the screening campaign. Do certain scaffolds have a higher propensity of “surviving” till the in vivo assay? How does molecular weight, lipophilicity etc. affect a compounds “survival”? One could go up one level of abstraction and do a meta-analysis of screening campaigns where related assays would be grouped (so assays of type X all represent time point Y), allowing us to ask whether specific assays can be more or less indicative of a compounds survival in a campaign. Survival models allow us to address these questions.

How can we translate the screening pipeline to the domain of survival analysis? Since each assay represents a time point, we can assign a “survival time” to each compound equal to the number of assays it is tested in. Having defined the Y-variable, we must then select the independent variables. Feature selection is a never-ending topic so there’s lots of room to play. It is clear however, that descriptors derived from the assays (say ADMET related descriptors) will not be truly independent if those assays are part of the sequence.

Having defined the X and Y variables, how do we go about modeling this type of data? First, we must decide what type of survivorship curve characterizes our data. Such a curve characterizes the proportion of individuals alive at a certain time point. There are three types of survivorship curves: I, II and III corresponding to scenarios where individuals have a higher risk of death at later times, a constant risk of death and individuals have a higher risk of death at earlier times, respectively.

For the case of the a screening campaign, a Type III survivorship curve seems most appropriate. There are other details, but in general, they follow from the type of survivorship curve selected for modeling. I will note that the hazard function is an important choice to be made when using parametric models. There a variety of functions to choose from, but either require that you know the error distribution or else are willing to use trial and error. The alternative is to use a non-parametric approach. The most common approach for this class of models is the Cox proportional hazards model. I won’t go into the details of either approach, save to note that using a Cox model does not allow us to make predictions beyond the last time point whereas a parametric model would. For the case at hand, we are not really concerned with going beyond the last timepoint (i.e., the last assay) but are more interested in knowing what factors might affect survival of compounds through the assay sequence. So, a Cox model should be sufficient. The survival package provides the necessary methods in R.

OK – it sounds cute, but has some obvious limitations

  1. The use of a survival model assumes a linear time line. In many screening campaigns, the individual assays may not follow each other in a linear fashion. So either they must be collapsed into a linear sequence or else some assays should be discarded.
  2. A number of the steps represent ‘subjective selection’. In other words, each time a subset of molecules are selected, there is a degree of subjectivity involved – maybe certain scaffolds are more tractable for med chem than others or some notion of interesting combined with a hunch that it will work out. Essentially chemists will employ heuristics to guide the selection process – and these heuristics may not be fully quantifiable. Thus the choice of independent variables may not capture the nuances of these heuristics. But one could argue that it is possible the model captures the underlying heuristics via proxy variables (i.e., the descriptors) and that examination of those variables might provide some insight into the heuristics being employed.
  3. Data size will be an issue. As noted, this type of scenario requires the use of a Type III survivorship curve (i.e., most death occurs at earlier times and the death rate decreases with increasing time). However, decrease in death rate is extremely steep – out of 400,000 compounds screened in a primary assay, maybe 2000 will be cherry picked for confirmation and about 50 molecules may be tested in secondary, orthogonal assays. If we go out further to ADMET and in vivo assays, we may have fewer than 10 compounds to work with. At this stage I don’t know what effect such a steeply decreasing survivorship curve would have on the model.

The next step is to put together a dataset to see what we can pull out of a survival analysis of a screening campaign.

Learning Representations – Digits, Cats and Now Molecules

Deep learning has been getting some press in the last few months, especially with the Google paper on recognizing cats (amongst other things) from Youtube videos. The concepts underlying this machine learning approach have been around for many years, though recent work by Hinton and others have led to fast implementations of the algorithms as well as better theoretical understanding.

It took me a while to realize that deep learning is really about learning an optimal, abstract representation in an unsupervised fashion (in the general case), given a set of input features. The learned representation can be then used as input to any classifier. A key aspect to such learned representations is that they are, in general, agnostic with respect to the final task for which they are trained. In the Google “cat” project this meant that the final representation developed the concept of cats as well as faces. As pointed out by a colleague, Bengio et al have published an extensive and excellent review of this topic and Baldi also has a nice review on deep learning.

In any case, it didn’t take too long for this technique to be applied to chemical data. The recent Merck-Kaggle challenge was won by a group using deep learning, but neither their code nor approach was publicly described. A more useful discussion of deep learning in cheminformatics was recently published by Lusci et al where they develop a DAG representation of structures that is then fed to a recursive neural network (RNN). They then use the resultant representation and network model to predict aqueous solubility.

A key motivation for the new graph representation and deep learning approach was the observation

one cannot be certain that the current molecular descriptors capture all the relevant properties required for solubility prediction

A related motivation was that they desired to apply deep learning methods directly to the molecular graph, which in general, is of variable size compared to fixed length representations (fingerprints or descriptor sets). It’s an interesting approach and you can read the paper for more details, but a few things caught my eye:

  • The motivation for the DAG based structure description didn’t seem very robust. Shouldn’t a learned representation be discoverable from a set of real-valued molecular descriptors (or even fingerprints)? While it is possible that all the physical aspects of aquous solubility may not be captured in the current repetoire of molecular descriptors, I’d think that most aspects are. Certainly some characterizations may be too time consuming (QM descriptors) for a cheminformatics setting.
  • The results are not impressive, compared to pre-existing model for the datasets they used. This is all the more surprising given that the method is actually an ensemble of RNN’s. For example, in Table 2 the best RNN model has an R2 of 0.92 versus 0.91 for the pre-existing model (a 2D kernel). But R2 is usually a good metric for non-linear regression. But even the RMSE is only 0.03 units better than the pre-existing model.However, it is certainly true that the unsupervised nature of the representation learning step is quite attractive – this is evident in the case of the intrinsic solubility dataset, where they achieve similar results to the prior model. But the prior model employed a manually selected set of topological descriptors.
  • It would’ve been very interesting to look at the transferabilty of the learned representation by using it to predict another physical property unrelated (at least directly) to solubility.

One characteristic of deep learning methods is that they work better when provided a lot of training data. With the exception of the Huuskonen dataset (4000 molecules), none of the datasets used were very large. If training set size is really an issue, the Burnham solubility dataset with 57K observations would have been a good benchmark.

Overall, I don’t think the actual predictions are too impressive using this approach. But the more important aspect of the paper is the ability to learn an internal representation in an unsupervised manner and the promise of transferability of such a representation. In a way, it’d be interesting to see what an abstract representation of a molecule could be like, analogous to what a deep network thinks a cat looks like.

Predictive models – Implementation vs Specification

Benjamin Good recently asked about the existence of public repositories of predictive molecular signatures. From his description, he’s looking for platforms that are capable of deploying predictive models. The need for something like this is certainly not restricted to genomics – the QSAR field has been in need for this for many years. A few years back I described a system to deploy R models and more recently the OCHEM platform attempts to address this. Pipelining tools usually have a web deployment mode that also supports this idea. One problem faced by such platforms in the cheminformatics area is that the deployed model must include the means to evaluate the input features (a.k.a., descriptors). Depending on the licenses associated with descriptor software such a bundle may not be easily deployed. A gene-based predictor obviously doesn’t suffer from this problem, so it should be easier to implement. Benjamin points out the Synapse platform which looks quite nice, but only supports R models (not necessarily a bad thing!). A very recent candidate for generic predictive model (amongst other things) deployment is via plugins for the BARD platform.

But in my mind, the deeper issue that should be addressed is that of model specification. With a robust specification, evaluation of the model could implemented in arbitrary languages and platforms – essentially decoupling model definition and model implementation. PMML is one approach to predictive model specifications and is quite general (and a good solution for the gene predictor models that Benjamin is interested in). A field-specific example would be QSAR-ML(also see here) for QSAR models. One could then imagine repositories of model specifications, with an ecosystem of tools and services that instantiate models from these specs.

Notes & thoughts from the IU semantics workshop

Over the last two days I attended a workshop titled Exploiting Big Data Semantics for Translational Medicine, held at Indiana University, organized by David Wild, Ying Ding, Katy Borner and Eric Gifford. The stated goals were to explore advances in translation medicine via data and semantic technologies, with a view towards possible fundable ideas and funding opportunites. A nicely arranged workshop that was pretty intense – minimal breaks, constant thinking – which is a good use of 2 days. As you can see from the workshop website, the attendees brought a variety of skills and outlooks to the meeting. For me this was one of the most attractive features of the workshop.

This post is a rough dump of some observations & thoughts during the workshop – I’m sure I’ve left out important comments, provide minimal attribution and I assume there will be a more thorough report coming out from the organizers. I also point out that I am an interested bystander to this field and somewhat of a semantic web/technology (SW/T) skeptic – so some views may be naive or just wrong. I like the ideas and concepts, I can see their value, but I have not been convinced to invest significant time and efort into “semantifying” my day to day work. A major motivation for my attending this workshop was to learn what the experts are doing and see how I could incorporate some of these ideas into my own work.

The Meeting

The first day started with 5 minute introductions which was quite useful and great overview talks by three of the attendees. With the information dump, a major focus of the day was a discussion of opportunities and challenges. This was a very useful session with attendees listing specific instances of challenges, opportunities, bottlenecks and so on. I was able to take some notes on the challenges, including

  • Funding – lack of it and difficulty in obtaining it (i.e., persuading funders)
  • Cultural and social issues around semantic approaches (e.g., why change what’s already working? etc)
  • Data problems such as errors being propagated through ontologies and semantic conversion processes etc (I wonder to what extent this is a result of automated conversion processes such as D2R, versus manual errors introduced during curation. I suspect a mix of both)
  • Hilbert Problems” – a very nice term coined by Katy to represent grand challenges or open problems that could serve as seeds around which the community could nucleate. (This aspect was of particular interest as I have found it difficult to identify compelling life science use cases that justify a retooling (even partial) of current workflows.)

The second day focused on breakout sessions, based on the opportunities and challenges listed the day before. Some notes on some of these sessions:

Bridging molecular data and clinical data – this session focused on challenges and opportunities in using molecular data together with clinical data to inform clinical decision making. Three broad opportunies came out of this, viz., Advancing understanding of disease conditions, Optimizing data types/measurements for clinical decision making outcomes and Drug repurposing. Certainly very broad goals, and not particularly focused on SW/T. My impression that SW/T can play an important role in standardization and optimization of coding standards to more easily and robustly connect molecular and clinical data sources. But one certainly needn’t invoke SW/T to address these opportunities

Knowledge discovery – the considerations addressed by this group included the fact that semantified data (vocabularies, ontologies etc) is increasing in volume and availability, tools are available to go from raw data to semantified forms and so on. An important point was made the quality is a key consideration at multiple levels – the raw data, the semantic representation and the links between semantic entities. A challenge identified by this group was to identify use cases that SW/T can resolve and traditional technologies cannot.

RDBMS vs semantic databases – this was an interesting session that tried to address the question of when one type of database is better than the other. It seems that the consensus was that certain problems are better suited for one type over another and a hybrid solution is usually a sensible approach – but that goes without saying. A comment was made that certain classes of problems that involve identifying paths between terms (nodes) are better suited for semantic (graph) databases – this makes intuitive sense, but there was a consensus that there weren’t any realistic applications that one could point to. I like the idea – have attributes in a RDBMS, but links in a graph database and use graph queries to identify relations and entities that are then mapped to the RDBMS. My concern with this is that path traversals are easy (Neo4J does this quite efficiently) – the problem is in the explosion of possible paths between nodes and the fact that the majority of them are trivial at best or nonsensical at worst. This suggests that relevance/ranking is a concern in semantic/graph databases.

The session of most interest to me was that of grand challenges. I think we got to 5  or 6 major challenges

  • How to represent knowledge (methods for, evaluation of)
  • How do changes in ontologies affect scientific research over time
  • How to construct an ontology from a set of ontologies (i.e. preexisting knowledge) that is better than the individual ones (and so links to how to evaluate an ontology in terms of “goodness”)
  • Error propagation from measurements to representation to analysis
  • Visualization of multi-dimensional / high dimensional data – while a general challenge, I think it’s correct that visual representations of semantified data (and their supporting infrastructure such as ontologies) could make the methods and tools much more accessible. Would’ve been nice if we had more discussion on this aspect

We finally ended with a discussion concrete projects that attendees would be interested in collaborating on and this was quite fruitful.

My Opinion

It turns out that a good chunk of the discussion focused on translational medicine (clinical informatics, drug repurposing etc.) and the use of different data types to enable life science research, but largely independent of SW/T. Indeed, the role of SW/T seemed rather fuzzy at times – to some extent, a useful tool, but not indispensable. My impression was that much of the SW/T that was discussed really focused on labeling of knowledge via ontologies and making links between datasets and the challenges faced during these operations (which is fine and important – but does it justify funding?).

I certainly got some conflicting views of the state of the art. Comments from Amit Sheth made it appear that SW/T is well established and the main problems are solved, based on deployed applications in the “enterprise”. But comments from many of the attendes working in life sciences suggested many problems in dealing and working with semantic data. Sure, Google has it’s Knowledge Graph and other search engines are employing SW/T under the hood. But if it’s so well established, where are the products, tools and workflows that an informatics-savvy non-expert in SW/T can employ? Does this mean research funding is not really required and it’s more of a productization/monetization issue? Or is this a domain specific issue – what works for general search doesn’t necessarily work in the life sciences?

My fundamental issue is the absence of a “killer application” – an application or use case that gives a non-trivial result, that could not be achieved via traditional means. (I qualify this, by asking for such use cases in life sciences. Maybe bankers have already found their killer applications). Depending on the semantic technology one considers there are partial answers: ontologies are an example of such a use case, when used to enable linkages between datasets and sources across domains. To me this makes perfect sense (and is of particular interest and use in current projects such as BARD). But surely, there must be more than designing ontologies and annotating data with ontological terms? One of the things that was surprising to me was that some of the future problems that were considered for possible collaborations were not really dependent on SW/T – in other words, they could largely be addressed via pre-existing methodologies.

My (admittedly cursory) reading of the SW/T literature seems to suggest to me that a major promise of this field is “reasoning” over my data. And I’m waiting for non-trivial assertions made based on linked data, ontologies and so on – that really highlight where my SQL tables will fail. It’s not sufficient (to me) to say that what took me 50 lines of Python code takes you 2 lines of SPARQL – I have an investment made in my RDBMS, API’s and codebase and yes, it takes a bit more fiddling – but I can get my answer in 5 minutes because it’s already been set up.

Some points were made regarding challenges faced by SW/T including complexity of OWL, difficulty in leaning SPARQL, poor performanec queries. Personally, these are not valid challenges and I certainlly do not make the claim that tricky SPARQL queries are preventing me from jumping into SW/T. I’m perfectly willing to wait 5 min for a SPARQL query to run, if the outcome is of sufficient value. The bigger issue for me is the value of the outcomes – maybe it’s just too early for truly novel, transformative results to be produced. Or maybe it’s simply one tool amongst others that can be used to tackle a certain class of problems.

Overall, it was a worthwhile two days interacting with a group of interesting people. But definitely some fuzziness in terms of what role SW/T can, should or will play in translational life science research.