So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘R’ tag

Predictive models – Implementation vs Specification

with one comment

Benjamin Good recently asked about the existence of public repositories of predictive molecular signatures. From his description, he’s looking for platforms that are capable of deploying predictive models. The need for something like this is certainly not restricted to genomics – the QSAR field has been in need for this for many years. A few years back I described a system to deploy R models and more recently the OCHEM platform attempts to address this. Pipelining tools usually have a web deployment mode that also supports this idea. One problem faced by such platforms in the cheminformatics area is that the deployed model must include the means to evaluate the input features (a.k.a., descriptors). Depending on the licenses associated with descriptor software such a bundle may not be easily deployed. A gene-based predictor obviously doesn’t suffer from this problem, so it should be easier to implement. Benjamin points out the Synapse platform which looks quite nice, but only supports R models (not necessarily a bad thing!). A very recent candidate for generic predictive model (amongst other things) deployment is via plugins for the BARD platform.

But in my mind, the deeper issue that should be addressed is that of model specification. With a robust specification, evaluation of the model could implemented in arbitrary languages and platforms – essentially decoupling model definition and model implementation. PMML is one approach to predictive model specifications and is quite general (and a good solution for the gene predictor models that Benjamin is interested in). A field-specific example would be QSAR-ML (also see here) for QSAR models. One could then imagine repositories of model specifications, with an ecosystem of tools and services that instantiate models from these specs.

Written by Rajarshi Guha

May 1st, 2013 at 12:29 am

Visual pairwise comparison of distributions

without comments

While analysing some data from a dose respons screen, run across multiple cell lines, I need to visualize summarize curve data in a pairwise fashion. Specifically, I wanted to compaure area under the curve (AUC) values for the curve fits for the same compound between every pair of cell line. Given that an AUC needs a proper curve fit, this means that the number of non-NA AUCs is different for each cell line. As a result making  a scatter plot matrix (via plotmatrix) won’t do.

A more useful approach is to generate a matrix of density plots, such that each plot contains the distributions of AUCs from each pair of cell lines over laid on each other. It turns out that some data.frame wrangling and facet_grid makes this extremely easy.

Lets start with some random data, for 5 imaginary cell lines

1
2
3
4
5
6
7
8
library(ggplot2)
library(reshape)

tmp1 <- data.frame(do.call(cbind, lapply(1:5, function(x) {
  r <- rnorm(100, mean=sample(1:4, 1))
  r[sample(1:100, 20)] <- NA
  return(r)
})))

Next, we need to expand this into a form that lets us facet by pairs of variables

1
2
3
4
5
6
7
8
tmp2 <- do.call(rbind, lapply(1:5, function(i) {
  do.call(rbind, lapply(1:5, function(j) {
    r <- rbind(data.frame(var='D1', val=tmp1[,i]),
               data.frame(var='D2', val=tmp1[,j]))
    r <- data.frame(xx=names(tmp1)[i], yy=names(tmp1)[j], r)
    return(r)
  }))
}))

Finally, we can make the plot

1
2
3
4
ggplot(tmp2, aes(x=val, fill=var))+
  geom_density(alpha=0.2, position="identity")+
  theme(legend.position = "none")+
  facet_grid(xx ~ yy, scales='fixed')

Giving us the plot below.

I had initially asked this on StackOverflow where Arun provided a more elegant approach to composing the data.frame

Written by Rajarshi Guha

February 10th, 2013 at 3:03 pm

Chunking lists in R

without comments

A common task for is to run database queries on gene symbols or compound identifiers. This involves constructing an SQL query as a string and sending that off to the database. In the case of the ROracle package, the query strings are limited to a 1000 (?) or so characters. This means that directly querying for a thousand identifiers won’t work. And going through the list of identifiers one at a time is inefficient. What we need in this situation is a to “chunk” the list (or vector) of identifiers and work on individual chunks. With the help of the itertools package, this is very easy:

1
2
3
4
5
6
7
8
library(itertools)
n <- 1:11
chunk.size <- 3
it <- ihasNext(ichunk(n, chunk.size))
while (itertools::hasNext(it)) {
  achunk <- unlist(nextElem(it))
  print(achunk)
}

Written by Rajarshi Guha

July 5th, 2012 at 2:22 pm

Posted in software

Tagged with , ,

New Versions of rcdk & rcdklibs

without comments

With the recent stable release of the CDK (1.3.12) and the inclusion of the new rendering classes, I was able to make a new release of the rcdk (3.1.1) and rcdklibs (1.3.11) packages that support cheminformatics in R. They’ve been pushed to CRAN and should be visible in a day or two. The new features in the latest version of rcdk include

  • Directly evaluate molecular volume (based on group contributions) using get.volume
  • Generate fingerprints using the hybridization state
  • get.total.charge and get.total.formal.charge work sensibly
  • Added a function (copy.image.to.clipboard) that copies the 2D depiction of a molecule to the system clipboard in PNG format
  • Now, OS X users can view and copy molecule depictions. This is slower compared to the same operation on Windows or Linux since it involves shell’ing out via system. But it is better than not being able to view anything.

Written by Rajarshi Guha

June 18th, 2011 at 7:41 pm

Posted in cheminformatics,software

Tagged with ,

New Version of fingerprint

with 3 comments

I’ve submitted version 3.4.3 of the fingerprint package to CRAN, so it should be available in a day or two. It’s an R package that lets you read in (chemical structure) fingerprint data from a variety of sources (CDK, MOE, BCI etc) and perform a variety of operations (bitwise, similarity, etc.) and visualizations on them.

The two main additions to this version are

  • Read support for the new FPS fingerprint format described by Andrew Dalke at the chemfp project. Note, it currently discards some of header information
  • The fingerprint class now has a field, misc, (a list) that allows one to read in extra, arbitrary data that might be provided along with a fingerprint. Exactly what gets stored in this field depends on the line function used to read in the fingerprint data. Currently only the FPS parser returns extra data (when available) in this field.

As always, you can get the package source directly from the Github repository.

Written by Rajarshi Guha

June 3rd, 2011 at 12:13 am

Posted in cheminformatics

Tagged with , ,