So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘software’ Category

The CDK Volume Descriptor

with one comment

Sometime back Egon implemented a simple group contribution based volume calculator and it made its way into the stable branch (1.4.x) today. As a result I put out a new version of the CDKDescUI which includes a descriptor that wraps the new volume calculator as well as the hybridization fingerprinter that Egon also implemented recently. The volume descriptor (based on the VABCVolume class) is one that has been missing for the some time (though the NumericalSurface class did return a volume, but it’s slow). This class is reasonably fast (10,000 molecules processed in 32 sec) and correlates well with the 2D and pseudo-3D volume descriptors from MOE (2008.10) as shown below. As expected the correlation is better with the 2D version of the descriptor (which is similar in nature to the lookup method used in the CDK version). The X-axis represents the CDK descriptor values.

Written by Rajarshi Guha

June 17th, 2011 at 11:42 pm

Posted in cheminformatics,software

Tagged with , , ,

Accessing High Content Data from R

without comments

Over the last few months I’ve been getting involved in the informatics & data mining aspects of high content screening. While I haven’t gotten into image analysis itself (there’s a ton of good code and tools already out there), I’ve been focusing on managing image data and meta-data and asking interesting questions of the voluminuous, high-dimensional data that is generated by these techniques.

One of our platforms is ImageXpress from Molecular Devices, which stores images in a file-based image store and meta data and numerical image features in an Oracle database. While they do provide an API to interact with the database it’s a Windows only DLL. But since much of modeling requires I access the data from R, I needed a more flexible solution.

So, I’ve put together an R package that allows one to access numeric image data (i.e., descriptors) and images themselves. It depends on the ROracle package (which in turns requires an Oracle client installation).

Currently the functionality is relatively limited, focusing on my common tasks. Thus for example, given assay plate barcodes, we can retrieve the assay ids that the plate is associated with and then for a given assay, obtain the cell-level image parameter data (or optionally, aggregate it to well-level data). This task is easily parallelizable – in fact when processing a high content RNAi screen, I make use of snow to speed up the data access and processing of 50 plates.

con <- get.connection(user='foo', passwd='bar', sid='baz')
plate.barcode <- 'XYZ1023' <- get.plates(con, plate.barcode)

## multiple analyses could be run on the same plate - we need
## to get the correct one (MX uses 'assay' to refer to an analysis run)
## so we first get details of analyses without retrieving the actual data
details <-, barcode=plate.barcode, dry=TRUE)
details <- subset(ret, PLATE_ID == & SETTINGS_NAME == <- details$ASSAY_ID

## finally, get the analysis data, using median to aggregate cell-level data <-  get.assay(con,, aggregate.func=median, verbose=FALSE, na.rm=TRUE)

Alternatively, given a plate id (this is the internal MetaXpress plate id) and a well location, one can obtain the path to the relevant image(s). With the images in hand, you could use EBImage to perform image processing entirely in R.

## will want to set IMG.STORE.LOC to point to your image store
con <- get.connection(user='foo', passwd='bar', sid='baz')
plate.barcode <- 'XYZ1023' <- get.plates(con, plate.barcode)
get.image.path(con,, 4, 4) ## get images for all sites & wavelengths

Currently, you cannot get the internal plate id based on the user assigned plate name (which is usually different from the barcode). Also the documentation is non-existant, so you need to explore the package to learn the functions. If there’s interest I’ll put in Rd pages down the line. As a side note, we also have a Java interface to the MetaXpress database that is being used to drive a REST interface to make our imaging data accessible via the web.

Of course, this is all specific to the ImageXpress platform – we have others such as InCell and Acumen. To have a comprehensive solution for all our imaging, I’m looking at the OME infrastructure as a means of, at the very least, have a unified interface to the images and their meta data.

Written by Rajarshi Guha

May 27th, 2011 at 5:01 am

Posted in software,Uncategorized

Tagged with , , ,

Call for Papers: High Content Screening: Exploring Relationships Between Small Molecules and Phenotypic Results

without comments

242nd ACS National Meeting
Denver, Aug 28 – Sept 1, 2011
CINF Division

Dear Colleagues, we are organizing an ACS symposium, focusing on the use of High Content Screening (HCS) for small molecule applications. High content screens, while resource intensive, are capable of providing a detailed view of the phenotypic effects of small molecules. Traditional reporter based screens are characterized by a one-dimensional signal. In contrast, high content screens generate rich, multi-dimensional datasets that allow for wide-ranging and in-depth analysis of various aspects of chemical biology including mechanisms of action, target identification and so on. Recent developments in high-throughput HCS pose significant challenges throughout the screening pipeline ranging from assay design and miniaturization to data management and analysis. Underlying all of this is the desire to connect chemical structure to phenotypic effects.

We invite you to submit contributions highlighting novel work and new developments in High Content Screening (HCS), High Content Analysis (HCA), and data exploration as it relates to the field of small molecules. Topics of interest include but are not limited to

  • Compound & in silico screening for drug discovery
  • Compound profiling by high content analysis
  • Chemistry & probes in imaging
  • Lead discovery strategies – one size fits all or horses for courses?
  • Application of HCA in discovering toxicology screening strategies
  • Novel data mining approaches for HCS data that link phenotypes to chemical structures
  • Software & informatics for HCS data management and integration
In addition to these topics special consideration will be given to contributions that present contributions in in-silico exploration based on HCS data. We would also like to point out that sponsorship opportunities are available. The deadline for abstract submissions is April 1, 2011. All abstracts should be submitted via PACS at If you have any questions feel free to contact Tim or myself.

Tim Moran
+1 858 799 5609

Rajarshi Guha
NIH Chemical Genomics Center
+1 814 404 5449

Written by Rajarshi Guha

March 24th, 2011 at 12:26 pm

Posted in research,software

Tagged with , , , ,

CDK & logP Values

with 2 comments

Recently, Tony Williams enquired whether there had been any comparisons of the CDK with other tools for the calculation of polar surface area (PSA) and logP. Given that PSA calculations using the fragments defined by Ertl et al are pretty straightforward, it’s not surprising that the CDK implementation matches very well with the ACD Labs implementation (based on 57,000 molecules). More interesting however is the performance of different logP methods on experimental data. (Note that Mannhold et al performed a very comprehensive comparison of logP predictors. This post just focuses on the CDK).

To that end I evaluated logP values for ~ 10,000 molecules from the (proprietary) logPstar dataset, using the CDK’s XLogP implementation, ACD Labs (v12) and ChemAxon (c5.2.1_1). As can be seen from the plots, ACD performs best and the XLogP method fairs quite poorly. In all cases, default settings were used. In addition the CDK has an implementation of ALogP, but it performed so poorly that I don’t list it here.

Given that the ACD predictions are based on a neural network model, I was interested in how well a predictive model based on CDK descriptors would perform when trained on this dataset. Since this was just a quick exploration, I didn’t put too much effort into the model building process. So I evaluated a set of CDK topological and constitutional descriptors and performed minimal feature selection to remove those descriptors with undefined values – giving a final pool of 111 descriptors.

I split the dataset into a training and prediction set (60/40 split) and then threw them into a random forest model, which performs implicit feature selection and doesn’t overfit. As the plot shows, the performance is significantly better than XLogP (training set R2 = 0.87 and prediction set R2 = 0.86). Multiple training/prediction set splits gave similar results.

While it’s not as good as the ACD model, it was obtained using about 20 minutes of effort. Certainly, moving to a neural network or SVM model coupled with an explicit feature selection approach should lead to further improvements in the performance of this model.

Written by Rajarshi Guha

February 3rd, 2011 at 2:53 am

Caching SMARTS Queries

with 3 comments

Andrew Dalke recently published a detailed write up on his implementation of the Pubchem fingerprints and provided a pretty thorough comparison with the CDK implementation. He pointed out a number of bugs in the CDK version; but he also noted that performance could be improved by caching parsed SMARTS queries – which are used extensively in this fingerprinter. So I wanted to see whether caching really helps.

The CDK SMARTS parser is a JJTree based implementation and my anecdotal evidence suggested that SMARTS parsing was not a real bottleneck. But of course, nothing beats measurements. So I modified the SMARTSQueryTool class to cache parsed SMARTS queries using a simple LRU cache (patch):

final int MAX_ENTRIES = 20;
Map<String, QueryAtomContainer> cache = new LinkedHashMap<String, QueryAtomContainer>(MAX_ENTRIES + 1, .75F, true) {
    public boolean removeEldestEntry(Map.Entry eldest) {
        return size() > MAX_ENTRIES;

The map is keyed on the SMARTS string. Then, when we attempt to parse a new SMARTS query, we check if it’s in the cache and if not, parse it and place it in the cache.

Now, the thing about this caching mechanism is that after 20 SMARTS queries have been cached, the least recently used one is replaced with a new one. As a result, if we perform matching with 40 unique SMARTS (in sequence) only the last 20 get cached, for a given molecule. But when we go to do it on a new molecule, the first 20 are not in the cache and hence we shouldn’t really benefit from the caching scheme. In general, if the fingerprinter (or any caller of the parser) will perform N unique SMARTS queries for a single molecule, the cache size must be at least N, for the cache to be used for subsequent molecules.

I implemented a quick test harness, reading in 100 SMILES and then generating Pubchem fingerprints for each molecule. The fingerprint generation was repeated 5 times and the time reported for each round. The box plot shows the distribution of the timings. Now, the CDK implementation has 621 SMARTS patterns – as you can see, we only get a benefit from the caching when the cache size is 700. In fact, cache sizes below that lead to a performance hit  - I assume due to time required to query the map.

While the performance improvement is not dramatic it is close to 10% compared to no-caching at all. In actuality, the major bottleneck in the SMARTS parser is the actual graph isomorphism step (which we hope to drastically improve by using the SMSD code). Maybe then, SMARTS parsing will take a bigger fraction of the time. Also keep in mind that due to the heavyweight nature of CDK molecule objects, very large caches could be a strain on memory. But to check that out, I should use a profiler.

Written by Rajarshi Guha

January 23rd, 2011 at 12:56 am