# So much to do, so little time

Trying to squeeze sense out of chemical data

## Database Licensing & Sustainability

Update (07/28/16): DrugBank/OMx have updated the licensing conditions for DrugBank data in response to concerns raised earlier by various people and groups. See here for a detailed response from Craig Knox

A few days back I came across, via my Twitter network, the news that DrugBank had changed their licensing policy to CC BY-SA-NC 4.0. As such this is not a remarkable change (though one could argue about the NC clause, since as John Overington points out the distinction between commercial and non-commercial usage can be murky). However, on top of this license, the EULA listed a number of more restrictive conditions on reuse of the data. See this thread on ThinkLab for a more detailed discussion and breakdown.

This led to discussion amongst a variety of people regarding the sustainability of data resources. In this case while DrugBank was (and is) funded by federal grants, these are not guaranteed in perpetuity. And thus DrugBank, and indeed any resource, needs to have a plan to sustain itself. Charging for commercial access is one such approach. While it can be  problematic for reuse and other Open projects, one cannot fault the developers if they choose a path that enables them to continue to build upon their work.

Interestingly, the Guide to Pharmacology resource posted a response to the DrugBank license change, in which they don’t comment on the DrugBank decision but do point out that

The British Pharmacological Society (BPS) has committed support for GtoPdb until 2020 and the Wellcome Trust support for GtoImmuPdb until 2018. Needless to say the management team (between, IUPHAR, BPS and the University of Edinburgh) are engaged in sustainability planning beyond those dates. We have also just applied for UK ELIXIR Node consideration.

So it’s nice to see that the resource is completely free of any onerous restrictions until 2020. I have no doubt that the management team will be working hard to secure funding beyond that date. But in case they don’t, will their licensing also change to support some form of commercialization? Certainly, other resources are going down that path. John Overington pointed to BioCyc switching to a subscription model

So the sustainability of data resources is an ongoing problem, and will become a bigger issue as the links between resources grows over time. Economic considerations would suggest that permanent funding of every database  cannot happen.

So clearly, some resources will win and some will lose, and the winners will not stay winners forever.

### Open source software & transferring leadership

However in contrast to databases, many Open Source software projects do continue development over pretty long time periods. Some of these projects receive public funding and also provide dual licensing options, allowing for income from industrial users.

However there are others which are not heavily funded, yet continue to develop. My favorite example is Jmol which has been in existence for more than 15 years and has remained completely Open Source. One of the key features of this project is that the leadership has passed from one individual to another over the years, starting I think with Dan Gezelter, then Bradley Smith, Egon Willighagen, Miguel Rojas and currently Bob Hanson.

Comparing Open software to Open databases is not fully correct. But this notion of leadership transition is something that could play a useful role in sustaining databases. Thus, if group X cannot raise funding for continued development, maybe group Y (that obviously benefits from the database) that has funding, could take over development and maintenance.

There are obvious reasons that this won’t work – maybe the expertise resides only in group X? I doubt this is really an issue, at least for non-niche databases. One could also argue that this approach is a sort of proto-crowdsourcing approach. While crowdsourcing did come up in the Twitter thread, I’m not convinced this is a scalable approach to sustainability. The “diffuse motivation” of a crowd is quite distinct from the “focused motivation” of a dedicated group. And on top of that, many databases are specialized and the relevant crowd is rather small.

One ultimate solution is that governments host databases in perpetuity. This raises a myriad issues. Does it imply storage and no development? Is this for all publicly funded databases? Or a subset? Who are the chosen ones? And of course, how long will the government pay for it? The NIH Commons, while not being designed for database persistence, is one such prototypical infrastructure that could start addressing these questions.

In conclusion, the issue of database sustainability is problematic and unsolved and the problem is only going to get worse. While unfortunate for Open science (and science in general) the commercialization of databases will always be a possibility. One hopes that in such cases, a balance will be struck between income and free (re)usage of these valuable resources.

Written by Rajarshi Guha

May 14th, 2016 at 7:26 pm

## SLAS 2017: Let There Be Light: Informatics Approaches to Exploring the Dark Genome

I’m organizing a symposium at the 2017 SLAS meeting in Washington D.C in the Data Analysis and Informatics track. The topic of the symposium focuses on informatics approaches that shed light and explore the dark genome. The description is given below, and I hope you’ll consider submitting an abstract.

With efforts such as the NIH-funded Illuminating the Druggable Genome (IDG) program, there is great interest and a pressing need to understand the “dark genome” — the subset of genes that have little to no information about them in the literature or databases. This session will focus on current efforts by members of the IDG program and the community in general on developing informatics resources for data aggregation and integration, target prioritization and platform development. In addition, topics such as characterization of druggability and novel approaches to connecting heterogeneous datasets that allow us to shed light on the dark genome will be considered.

The deadline is Aug 8, 2016 and you can submit an abstract here.

Written by Rajarshi Guha

May 11th, 2016 at 3:46 pm

Posted in Uncategorized

## Analysing Differential Activity in Dose Response Screens

with one comment

My colleagues and I recently published a paper where we explored a few methods to identify differential behavior in dose response screens. While there is an extensive literature about analyzing differential effects in genomic data (e.g. mciroarrays, RNAseq), these methods are based on distributional assumptions that holds for genomic data. This is not necessarily the case for small molecule, dose response data. A separate post will explore this aspect.

So we couldn’t directly apply the methods devised for genomic data. Another issue that we wanted to address was the lack of replicates. As a result certain methods are excluded from consideration (e.g., t-test based methods). The simplest case (or what we refer to as obviously differential) is when a compound is active in one treatment and completely inactive in the other. This is trivial to characterize. The next method we considered was to look at fold changes for individual curve fit parameters and then choose an arbitrary threshold. This is not a particularly robust approach, and has no real statistical basis. However, such thresholding is still used in a number of scenarios (e.g., cherry picking in single point screens). In addition, in this approach you have to choose one of many parameters. So finally, we considered a data fusion approach, that ranked compounds using the rank product method. This method employed potency, response at the highest concentration and the AUC. The nice thing about this method is that it doesn’t require choosing a threshold, provides an empirical p-value and is flexible enough to include other relevant parameters (say, physicochemical properties).

Finally, we examined how single point data (modeled using the response at the highest concentration) compared to dose response data at identifying differential actives. As one might expect, the obviously differential compounds were easily identified. However for compounds active in both treatments, the single point approach led to more false positives. Thus, even those dose response is more resource-intensive, the improved accuracy makes it worth it.

In the next post I’ll look at some of the issues that didn’t make in to this paper – in particular hypothesis based tests that focus on testing differences between model fits. One key observation (also suggested by Gelman) is that strict p-value cutoffs lead one to focus on obvious or well-known effects. For small-scale exploratory analyses such as described in this paper, a more relaxed threshold of 0.1 might be more suitable, allowing marginal effects that may, however, be biologically interesting to be considered.

Written by Rajarshi Guha

May 2nd, 2016 at 2:10 am

## Call For Papers: Shedding Light on the Dark Genome – Methods, Tools & Case Studies

252nd ACS National Meeting
CINF Division

Dear Colleagues, we are organizing a symposium at the Fall ACS meeting in Philadelphia focusing on computational, experimental and hybrid approaches to characterizing the unstudied and understudied druggable genome.  In 2014 the NIH initiated a program titled, “Illuminating the Druggable Genome” (IDG) with the goal of improving our understanding of the properties and functions of proteins that are currently unannotated within the four most commonly drug-targeted protein families – GPCRs, ion channels, nuclear receptors and kinases. As part of this program a Knowledge Management Center (KMC) was formed, as a collaboration between six academic center, who’s goal was to develop an integrative informatics platform to collect data, develop data driven prioritization schemes, analytical methods  and disseminate standardized/annotated information related to the unannotated proteins in the four gene families of interest.

In this symposium, members of the various components of the IDG program will present the results of ongoing work related to experimental methods, target prioritization, data aggregation and platform development. In addition, we welcome contributions related to the identification of druggable targets, approaches to quantify druggability and novel approaches to integrating disparate data source with the goal of shedding light on the “dark genome”

The deadline for abstract submissions is March 29, 2016. All abstracts should be submitted via MAPS at http://bit.ly/1mMqLHj. If you have any questions feel free to contact  Tudor or myself

Rajarshi Guha
NCATS, NIH
guhar@mail.nih.gov

Tudor Oprea
University of New Mexico
TOprea@salud.unm.edu

Written by Rajarshi Guha

February 22nd, 2016 at 4:00 pm

Posted in Uncategorized

Tagged with , ,

## vSDC, Rank Products and DUD-E

This post is a follow-up to my previous discussion on a paper by Chaput et al. The gist of that paper was that in a virtual screening scenario where a small number of hits are to be selected for followup, one could use an ensemble of docking methods, identify compounds whose scores were beyond 2SD of the mean for each method and take the intersection. My post suggested that a non-parametric approach (rank products, RP) performed similarly to the parametric approach of Chaput et al on the two targets they screened.

The authors also performed a benchmark comparison of their consensus method (vSDC) versus the individual docking methods for 102 DUD-E targets. I was able to obtain the individual docking scores (Glide, Surflex, FlexX and GOLD) for each of the targets, with the aim of applying the rank product method described previously.

In short, I reproduced Figure 6A (excluding the curve for vSDC). In
this figure, $$n_{test}$$ is the number of compounds selected (from the ranked list, either by individual docking scores or by the rank product) and $$T_{h>0}$$ is the percentage of targets for which the $$n_{test}$$ selected compounds included one or more actives. Code is available here, but you’ll need to get in touch with the authors for the DUD-E docking scores.

As shown alongside, the RP method (as expected) outperforms the individual docking methods. And visual comparison with the original figure suggests that it also outperforms vSDC, especially at lower values of $$n_{test}$$. While I wouldn’t regard the better performance of RP compared to vSDC as a huge jump, the absence of a threshold certainly works in its favor.

One could certainly explore ranking approaches in more depth. As suggested by Abhik Seal, Borda or Condorcet methods could be examined (though the small number of docking methods, a.k.a., voter, could be problematic).

UPDATE: After a clarification from Liliane Mouawad it turns out there was a mistake in the ranking of the Surflex docking scores. Correcting that bug fixes my reproduction of Figure 6A so that the curves for individual docking methods match the original. But more interestingly, the performance of RP is now clearly better than every individual method and the vSDC method as well, at all values of $$n_{test}$$

Written by Rajarshi Guha

February 13th, 2016 at 7:25 pm