So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘R’ tag

Chunking lists in R

without comments

A common task for is to run database queries on gene symbols or compound identifiers. This involves constructing an SQL query as a string and sending that off to the database. In the case of the ROracle package, the query strings are limited to a 1000 (?) or so characters. This means that directly querying for a thousand identifiers won’t work. And going through the list of identifiers one at a time is inefficient. What we need in this situation is a to “chunk” the list (or vector) of identifiers and work on individual chunks. With the help of the itertools package, this is very easy:

1
2
3
4
5
6
7
8
library(itertools)
n <- 1:11
chunk.size <- 3
it <- ihasNext(ichunk(n, chunk.size))
while (itertools::hasNext(it)) {
  achunk <- unlist(nextElem(it))
  print(achunk)
}

Written by Rajarshi Guha

July 5th, 2012 at 2:22 pm

Posted in software

Tagged with , ,

New Versions of rcdk & rcdklibs

without comments

With the recent stable release of the CDK (1.3.12) and the inclusion of the new rendering classes, I was able to make a new release of the rcdk (3.1.1) and rcdklibs (1.3.11) packages that support cheminformatics in R. They’ve been pushed to CRAN and should be visible in a day or two. The new features in the latest version of rcdk include

  • Directly evaluate molecular volume (based on group contributions) using get.volume
  • Generate fingerprints using the hybridization state
  • get.total.charge and get.total.formal.charge work sensibly
  • Added a function (copy.image.to.clipboard) that copies the 2D depiction of a molecule to the system clipboard in PNG format
  • Now, OS X users can view and copy molecule depictions. This is slower compared to the same operation on Windows or Linux since it involves shell’ing out via system. But it is better than not being able to view anything.

Written by Rajarshi Guha

June 18th, 2011 at 7:41 pm

Posted in cheminformatics,software

Tagged with ,

New Version of fingerprint

with 3 comments

I’ve submitted version 3.4.3 of the fingerprint package to CRAN, so it should be available in a day or two. It’s an R package that lets you read in (chemical structure) fingerprint data from a variety of sources (CDK, MOE, BCI etc) and perform a variety of operations (bitwise, similarity, etc.) and visualizations on them.

The two main additions to this version are

  • Read support for the new FPS fingerprint format described by Andrew Dalke at the chemfp project. Note, it currently discards some of header information
  • The fingerprint class now has a field, misc, (a list) that allows one to read in extra, arbitrary data that might be provided along with a fingerprint. Exactly what gets stored in this field depends on the line function used to read in the fingerprint data. Currently only the FPS parser returns extra data (when available) in this field.

As always, you can get the package source directly from the Github repository.

Written by Rajarshi Guha

June 3rd, 2011 at 12:13 am

Posted in cheminformatics

Tagged with , ,

Accessing High Content Data from R

without comments

Over the last few months I’ve been getting involved in the informatics & data mining aspects of high content screening. While I haven’t gotten into image analysis itself (there’s a ton of good code and tools already out there), I’ve been focusing on managing image data and meta-data and asking interesting questions of the voluminuous, high-dimensional data that is generated by these techniques.

One of our platforms is ImageXpress from Molecular Devices, which stores images in a file-based image store and meta data and numerical image features in an Oracle database. While they do provide an API to interact with the database it’s a Windows only DLL. But since much of modeling requires I access the data from R, I needed a more flexible solution.

So, I’ve put together an R package that allows one to access numeric image data (i.e., descriptors) and images themselves. It depends on the ROracle package (which in turns requires an Oracle client installation).

Currently the functionality is relatively limited, focusing on my common tasks. Thus for example, given assay plate barcodes, we can retrieve the assay ids that the plate is associated with and then for a given assay, obtain the cell-level image parameter data (or optionally, aggregate it to well-level data). This task is easily parallelizable – in fact when processing a high content RNAi screen, I make use of snow to speed up the data access and processing of 50 plates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(ncgchcs)
con <- get.connection(user='foo', passwd='bar', sid='baz')
plate.barcode <- 'XYZ1023'
plate.id <- get.plates(con, plate.barcode)

## multiple analyses could be run on the same plate - we need
## to get the correct one (MX uses 'assay' to refer to an analysis run)
## so we first get details of analyses without retrieving the actual data
details <- get.assay.by.barcode(con, barcode=plate.barcode, dry=TRUE)
details <- subset(ret, PLATE_ID == plate.id & SETTINGS_NAME == assay.name)
assay.id <- details$ASSAY_ID

## finally, get the analysis data, using median to aggregate cell-level data
hcs.data <-  get.assay(con, assay.id, aggregate.func=median, verbose=FALSE, na.rm=TRUE)

Alternatively, given a plate id (this is the internal MetaXpress plate id) and a well location, one can obtain the path to the relevant image(s). With the images in hand, you could use EBImage to perform image processing entirely in R.

1
2
3
4
5
6
library(ncgchcs)
## will want to set IMG.STORE.LOC to point to your image store
con <- get.connection(user='foo', passwd='bar', sid='baz')
plate.barcode <- 'XYZ1023'
plate.id <- get.plates(con, plate.barcode)
get.image.path(con, plate.id, 4, 4) ## get images for all sites & wavelengths

Currently, you cannot get the internal plate id based on the user assigned plate name (which is usually different from the barcode). Also the documentation is non-existant, so you need to explore the package to learn the functions. If there’s interest I’ll put in Rd pages down the line. As a side note, we also have a Java interface to the MetaXpress database that is being used to drive a REST interface to make our imaging data accessible via the web.

Of course, this is all specific to the ImageXpress platform – we have others such as InCell and Acumen. To have a comprehensive solution for all our imaging, I’m looking at the OME infrastructure as a means of, at the very least, have a unified interface to the images and their meta data.

Written by Rajarshi Guha

May 27th, 2011 at 5:01 am

Posted in software,Uncategorized

Tagged with , , ,

Similarity Matrices in Parallel

without comments

Today I got an email asking whether it’d be possible to speed up a fingerprint similarity matrix calculation in R. Now, pairwise similarity matrix calculations (whether they’re for molecules or sequences or anything else) are by definition quadratic in nature. So performing these calculations for large collections aren’t always feasible – in many cases, it’s worthwhile to rethink the problem.

But for those situations where you do need to evaluate it, a simple way to parallelize the calculation is to evaluate the similarity of each molecule with all the rest in parallel. This means each process/thread must have access to the entire set of fingerprints. So again, for very large collections, this is not always practical. However, for small collections parallel evaluation can lead to speed ups.

The fingerprint package provides a method to directly get the similarity matrix for a set of fingerprints, but this is implemented in interpreted R so is not very fast. Given a list of fingerprints, a manual evaluation of the similarity matrix can be done using nested lapply’s:

1
2
3
4
library(fingerprint)
sims <- lapply(fps, function(x) {
  unlist(lapply(fps, function(y) distance(x,y)))
})

For 1012 fingerprints, this takes 286s on my Macbook Pro (4GB, 2.4 GHz). Using snow, we can convert this to a parallel version, which takes 172s on two cores:

1
2
3
4
5
6
7
8
library(fingerprint)
library(snow)
cl <- makeCluster(4, type = "SOCK")
clusterEvalQ(cl, library(fingerprint))
clusterExport(cl, "fps")
sim <- parLapply(cl, fps, function(x) {
  unlist(lapply(fps, function(y) distance(x,y)))
})

Written by Rajarshi Guha

December 2nd, 2010 at 1:03 am