So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘software’ Category

“Type-ahead” substructure searches

with one comment

The other day I was exchanging emails with John Van Drie regarding open challenges in cheminformatics (which I’ll say more about later). One of his comments concerned the slow speed of chemical searches

Google searches are screamingly fast, so fast that the type-ahead feature is doing the search as you key characters in.  Why are all chemical searches so sloooow? … Ideally, as you sketch your mol in, the searches should be happening at the same pace, like the typeahead feature.

Now, he doesn’t specifically mention what type of chemical search – it could be exact matches, similarity searches, substructure or pharmacophore searches. The first two can be done very quickly and lend themselves easily to type ahead type search interfaces. In light of the work my colleague has been doing, the substructure searches are now also amenable to a type ahead interface.

So I quickly put together a simple web page that lets you type in a SMILES (or SMARTS) and as you type it retrieves the results of a substructure search via the NCTT Search Server REST API. (In some cases the depiction is broken – that’s a bug on my side). Of course, typing in SMILES is not the most intuitive of interfaces. Since Trung employs the ChemDoodle sketcher, an ideal interface would respond to drawing events (say drawing a bond or adding atoms etc) and pull up matches on the fly. Another obvious extension is to rank (or filter) the results – all the while, maintaining the near real time speed of the application.

As I said before, seriously fast substructure searches. It also helps that I can build these examples via a public REST API. I’m sure there are reasons for SOAP, XML and so on. But it’s 2011. So lets help make extensions and mashups easier.

UPDATE: Yes, it’s easy to create patterns (especially with SMARTS) that DoS the server. We have some filters for excessively generic patterns; so some queries may not behave in the expected manner

Written by Rajarshi Guha

November 28th, 2011 at 4:07 pm

Substructure Searches – High Speed, Large Scale

with 5 comments

My NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won’t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available here). He’s provided a web interface allowing one to draw in substructure queries or specify SMILES or SMARTS patterns, and then search for substructures across a snapshot of PubChem (more than 30M structures).

It is blazingly fast.

I decided to run some benchmarks via the REST interface that he provided, using a set of 1000 SMILES derived from an in-house fragmentation of the MLSMR. The 1000 structure subset is available here. For each query structure I record the number of hits, time required for the query and the number of atoms in the query structure. The number of atoms in the query structures ranged from 8 to 132, with a median of 16 atoms.

The figure below shows the distribution of hits matching the query and the time required to perform the query (on the server) for the 1000 substructures. Clearly, the bulk of the queries take less than 1 sec, even though the result set can contain more than 10,000 hits.

The figures below provide another look. On the left, I plot the number of hits versus the size of the query. As expected, the number of matches drops of with the size of the query. We also observe the expected trend between query times and the size of the result sets. Interestingly, while not a fully linear relationship, the slope of the curve is quite low. Of course, these times do not include retrieval times (the structures themselves are stored in an Oracle database and must be retrieved from there) and network transfer times.

Finally, I was also interested in getting an idea of the number of hits returned for a given size of query structure. The figure below summarizes this data, highlighting the variation in result set size for a given number of query atoms. Some of these are not valid (e.g., query structures with 35, 36, … atoms) as there were just a single query structure with that number of atoms.

Overall, very impressive. And it’s something you can play with yourself.

Written by Rajarshi Guha

November 23rd, 2011 at 1:09 am

New Versions of rcdk & rcdklibs

without comments

With the recent stable release of the CDK (1.3.12) and the inclusion of the new rendering classes, I was able to make a new release of the rcdk (3.1.1) and rcdklibs (1.3.11) packages that support cheminformatics in R. They’ve been pushed to CRAN and should be visible in a day or two. The new features in the latest version of rcdk include

  • Directly evaluate molecular volume (based on group contributions) using get.volume
  • Generate fingerprints using the hybridization state
  • get.total.charge and get.total.formal.charge work sensibly
  • Added a function (copy.image.to.clipboard) that copies the 2D depiction of a molecule to the system clipboard in PNG format
  • Now, OS X users can view and copy molecule depictions. This is slower compared to the same operation on Windows or Linux since it involves shell’ing out via system. But it is better than not being able to view anything.

Written by Rajarshi Guha

June 18th, 2011 at 7:41 pm

Posted in cheminformatics,software

Tagged with ,

The CDK Volume Descriptor

with one comment

Sometime back Egon implemented a simple group contribution based volume calculator and it made its way into the stable branch (1.4.x) today. As a result I put out a new version of the CDKDescUI which includes a descriptor that wraps the new volume calculator as well as the hybridization fingerprinter that Egon also implemented recently. The volume descriptor (based on the VABCVolume class) is one that has been missing for the some time (though the NumericalSurface class did return a volume, but it’s slow). This class is reasonably fast (10,000 molecules processed in 32 sec) and correlates well with the 2D and pseudo-3D volume descriptors from MOE (2008.10) as shown below. As expected the correlation is better with the 2D version of the descriptor (which is similar in nature to the lookup method used in the CDK version). The X-axis represents the CDK descriptor values.

Written by Rajarshi Guha

June 17th, 2011 at 11:42 pm

Posted in cheminformatics,software

Tagged with , , ,

Accessing High Content Data from R

without comments

Over the last few months I’ve been getting involved in the informatics & data mining aspects of high content screening. While I haven’t gotten into image analysis itself (there’s a ton of good code and tools already out there), I’ve been focusing on managing image data and meta-data and asking interesting questions of the voluminuous, high-dimensional data that is generated by these techniques.

One of our platforms is ImageXpress from Molecular Devices, which stores images in a file-based image store and meta data and numerical image features in an Oracle database. While they do provide an API to interact with the database it’s a Windows only DLL. But since much of modeling requires I access the data from R, I needed a more flexible solution.

So, I’ve put together an R package that allows one to access numeric image data (i.e., descriptors) and images themselves. It depends on the ROracle package (which in turns requires an Oracle client installation).

Currently the functionality is relatively limited, focusing on my common tasks. Thus for example, given assay plate barcodes, we can retrieve the assay ids that the plate is associated with and then for a given assay, obtain the cell-level image parameter data (or optionally, aggregate it to well-level data). This task is easily parallelizable – in fact when processing a high content RNAi screen, I make use of snow to speed up the data access and processing of 50 plates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(ncgchcs)
con <- get.connection(user='foo', passwd='bar', sid='baz')
plate.barcode <- 'XYZ1023'
plate.id <- get.plates(con, plate.barcode)

## multiple analyses could be run on the same plate - we need
## to get the correct one (MX uses 'assay' to refer to an analysis run)
## so we first get details of analyses without retrieving the actual data
details <- get.assay.by.barcode(con, barcode=plate.barcode, dry=TRUE)
details <- subset(ret, PLATE_ID == plate.id & SETTINGS_NAME == assay.name)
assay.id <- details$ASSAY_ID

## finally, get the analysis data, using median to aggregate cell-level data
hcs.data <-  get.assay(con, assay.id, aggregate.func=median, verbose=FALSE, na.rm=TRUE)

Alternatively, given a plate id (this is the internal MetaXpress plate id) and a well location, one can obtain the path to the relevant image(s). With the images in hand, you could use EBImage to perform image processing entirely in R.

1
2
3
4
5
6
library(ncgchcs)
## will want to set IMG.STORE.LOC to point to your image store
con <- get.connection(user='foo', passwd='bar', sid='baz')
plate.barcode <- 'XYZ1023'
plate.id <- get.plates(con, plate.barcode)
get.image.path(con, plate.id, 4, 4) ## get images for all sites & wavelengths

Currently, you cannot get the internal plate id based on the user assigned plate name (which is usually different from the barcode). Also the documentation is non-existant, so you need to explore the package to learn the functions. If there’s interest I’ll put in Rd pages down the line. As a side note, we also have a Java interface to the MetaXpress database that is being used to drive a REST interface to make our imaging data accessible via the web.

Of course, this is all specific to the ImageXpress platform – we have others such as InCell and Acumen. To have a comprehensive solution for all our imaging, I’m looking at the OME infrastructure as a means of, at the very least, have a unified interface to the images and their meta data.

Written by Rajarshi Guha

May 27th, 2011 at 5:01 am

Posted in software,Uncategorized

Tagged with , , ,