Accessing High Content Data from R

Over the last few months I’ve been getting involved in the informatics & data mining aspects of high content screening. While I haven’t gotten into image analysis itself (there’s a ton of good code and tools already out there), I’ve been focusing on managing image data and meta-data and asking interesting questions of the voluminuous, high-dimensional data that is generated by these techniques.

One of our platforms is ImageXpress from Molecular Devices, which stores images in a file-based image store and meta data and numerical image features in an Oracle database. While they do provide an API to interact with the database it’s a Windows only DLL. But since much of modeling requires I access the data from R, I needed a more flexible solution.

So, I’ve put together an R package that allows one to access numeric image data (i.e., descriptors) and images themselves. It depends on the ROracle package (which in turns requires an Oracle client installation).

Currently the functionality is relatively limited, focusing on my common tasks. Thus for example, given assay plate barcodes, we can retrieve the assay ids that the plate is associated with and then for a given assay, obtain the cell-level image parameter data (or optionally, aggregate it to well-level data). This task is easily parallelizable – in fact when processing a high content RNAi screen, I make use of snow to speed up the data access and processing of 50 plates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(ncgchcs)
con <- get.connection(user='foo', passwd='bar', sid='baz')
plate.barcode <- 'XYZ1023'
plate.id <- get.plates(con, plate.barcode)

## multiple analyses could be run on the same plate - we need
## to get the correct one (MX uses 'assay' to refer to an analysis run)
## so we first get details of analyses without retrieving the actual data
details <- get.assay.by.barcode(con, barcode=plate.barcode, dry=TRUE)
details <- subset(ret, PLATE_ID == plate.id & SETTINGS_NAME == assay.name)
assay.id <- details$ASSAY_ID

## finally, get the analysis data, using median to aggregate cell-level data
hcs.data <-  get.assay(con, assay.id, aggregate.func=median, verbose=FALSE, na.rm=TRUE)

Alternatively, given a plate id (this is the internal MetaXpress plate id) and a well location, one can obtain the path to the relevant image(s). With the images in hand, you could use EBImage to perform image processing entirely in R.

1
2
3
4
5
6
library(ncgchcs)
## will want to set IMG.STORE.LOC to point to your image store
con <- get.connection(user='foo', passwd='bar', sid='baz')
plate.barcode <- 'XYZ1023'
plate.id <- get.plates(con, plate.barcode)
get.image.path(con, plate.id, 4, 4) ## get images for all sites & wavelengths

Currently, you cannot get the internal plate id based on the user assigned plate name (which is usually different from the barcode). Also the documentation is non-existant, so you need to explore the package to learn the functions. If there’s interest I’ll put in Rd pages down the line. As a side note, we also have a Java interface to the MetaXpress database that is being used to drive a REST interface to make our imaging data accessible via the web.

Of course, this is all specific to the ImageXpress platform – we have others such as InCell and Acumen. To have a comprehensive solution for all our imaging, I’m looking at the OME infrastructure as a means of, at the very least, have a unified interface to the images and their meta data.

MIOSS Workshop Wrap Up

The last few days I’ve been at the EBI, attending the Molecular Informatics Open Source Software (MIOSS) workshop. As part of this trip to the UK, I’ve also had the opportunity to present some of the work my colleagues and I have done at the NCTT – thanks to Mark Forster for the invitation to speak at Syngenta and to John Chambers for having me speak to the ChEMBL group. At the workshop I presented my work on cheminformatics in R.

The focus of the workshop was to bring OSS developers and users from industry and academica/government together to hear about a variety of projects and discuss issues underlying the development and use of these projects. There were some very nice presentations – I won’t go into too much detail but some highlights for me included

  • Kevin Lawson (Syngenta) presented his work on LICSS – integrating the CDK with Excel. While I’ not a fan of Excel, it’s a necessary evil. I was quite surprised at the performance he acheived for substructure searches within Excel and the ability to access various functionalities of the CDK as Excel functions. While it probably won’t replace Accord or ChemOffice right now, it’s something to take a look at.
  • Mike Bodkin (Lilly) spoke about the use of KNIME at Lilly. They have built up an extensive collection of commercial and OSS nodes and it’s clear that KNIME is capable of giving Pipeline Pilot a run for its money. Thorsten Mienl then spoke of the OSS development of KNIME, and mentioned that they now support a collection of HCS and image analysis nodes (courtesy MPI Dresden). This is quite interesting, given that we’re ramping up our HCS capabilities at the NCTT
  • Hans de Winter of Silicos spoke about the tools and services that their company has produced on top of OpenBabel (and contributed back to the community). Quite encouraging to see a cheminformatics company making money of the OSS stack
  • Greg Landrum spoke about RDKit, presenting the RDKit based catridge for Postregsql. He showed some nice performance numbers and it was nice to see that they had gotten the coders who implemented the GiST indexing mechanisms to implement a GiST index for binary fingerprints.

In addition to these, there were other talks on Openbabel, Cinfony, Taverna, fpocket and others. While I’ve known about many of these projects it was useful to learn some of the details from the developers themselves.

A number of issues surrounding OSS development and use were discussed. For example, community development was regarded as a key factor in the success of OSS projects. Erik Lindahl of GROMACS fame, spoke about the development model of GROMACS and how important their success has been due to community involvement. Some other issues included the importance (and lack of) good documentation, what makes people contribute to OSS and so on.

The fact that industry participation was about 50% of group was nice. And a number of industry-related issues also arose. For example, there were several discussion of business models based around OSS and how they can feed back into OSS projects. A commen thread seemed to be that service and customization of OSS are good approaches to building businesses around the OSS stack, Silicos and Eagle Genomics being two prime examples.

The fact that there are industry users of OSS as well as industry members contributing back to OSS projects was very encouraging. An idea supported by a number of participants was some form of web site / wiki where such contributors and users could list themselves. (IMO, the Blue Obelisk wiki, could be a candidate for this type of thing).  Sure, there’d be usually corporate and legal barriers to this type of thing, but if done would have a number of benefits – encouragement for project developers and easily viewable precedent that would encourage other companies to use or participate in OSS projects, resulting in a positive feedback loop. With various pre-competitive collaboration efforts (e.g., Pistoia Alliance) popping up in the pharma industry, this is certainly possible.

Finally, it’s always good to meet up with old friends and also meet people whom I’ve only known over email. The social aspects of the workshop were very nice – helped greatly by excellent food and drink! Thanks to Mark for putting together a great meeting.

ICCS 2011

A few openings are left for the International Conference on Chemical Structures (ICCS)

A little less than 40 days left until the 9th International Conference on Chemical Structures (ICCS) starts in Noordwijkerhout, The Netherlands. The conference will focus on the latest scientific and technological developments in cheminformatics and related areas in six plenary sessions:

o Cheminformatics
o Structure-Activity and Structure-Property Prediction
o Structure-Based Drug Design and Virtual Screening
o Analysis of Large Chemistry Spaces
o Integrated Chemical Information
o Dealing with Biological Complexity

34 scientific lectures and 80 posters in two poster sessions will present applications and case studies as well as method development and algorithmic work in these areas. The program will open with a presentation by Engelbert Zass, ETH Zürich who has been awarded the CSA Trust Mike Lynch Award on the occasion of the 9th ICCS. We invite you to have a look at the scientific program which is now available at the website www.int-conf-chem-structures.org.

In addition to the scientific program there will be a commercial exhibition with 16 leading cheminformatics software suppliers. The participation of scientists from more than 20 countries will make this a truly international event with ample opportunities to networks and discuss science.

Free workshops will be offered before and after the official conference program by BioSolveIT (www.biosolveit.de), The Chemical Computing Group (www.chemcomp.com), Tripos (www.tripos.com), and Accelrys (www.accelrys.com).

On Wednesday afternoon there is a sailing cruise on the IJsselmeer on two traditional sailing boats. They will leave from the scenic Muiderslot castle, and then sail to the picturesque fishing village Volendam where the old village can be explored. A banquet dinner will be served on the boats on the way back.

If you are planning to attend, we encourage you to register as soon as possible through the conference web site: www.int-conf-chem-structures.org.

We are looking forward to meeting with you all in Noordwijkerhout.

Keith T Taylor, ICCS Chair
Markus Wagener, ICCS Chair

Call for Papers: High Content Screening: Exploring Relationships Between Small Molecules and Phenotypic Results

242nd ACS National Meeting
Denver, Aug 28 – Sept 1, 2011
CINF Division

Dear Colleagues, we are organizing an ACS symposium, focusing on the use of High Content Screening (HCS) for small molecule applications. High content screens, while resource intensive, are capable of providing a detailed view of the phenotypic effects of small molecules. Traditional reporter based screens are characterized by a one-dimensional signal. In contrast, high content screens generate rich, multi-dimensional datasets that allow for wide-ranging and in-depth analysis of various aspects of chemical biology including mechanisms of action, target identification and so on. Recent developments in high-throughput HCS pose significant challenges throughout the screening pipeline ranging from assay design and miniaturization to data management and analysis. Underlying all of this is the desire to connect chemical structure to phenotypic effects.

We invite you to submit contributions highlighting novel work and new developments in High Content Screening (HCS), High Content Analysis (HCA), and data exploration as it relates to the field of small molecules. Topics of interest include but are not limited to

  • Compound & in silico screening for drug discovery
  • Compound profiling by high content analysis
  • Chemistry & probes in imaging
  • Lead discovery strategies – one size fits all or horses for courses?
  • Application of HCA in discovering toxicology screening strategies
  • Novel data mining approaches for HCS data that link phenotypes to chemical structures
  • Software & informatics for HCS data management and integration
In addition to these topics special consideration will be given to contributions that present contributions in in-silico exploration based on HCS data. We would also like to point out that sponsorship opportunities are available. The deadline for abstract submissions is April 1, 2011. All abstracts should be submitted via PACS at http://abstracts.acs.org. If you have any questions feel free to contact Tim or myself.

Tim Moran
Accelrys
tmoran@accelrys.com
+1 858 799 5609

Rajarshi Guha
NIH Chemical Genomics Center
guhar@mail.nih.gov
+1 814 404 5449

Drug-Target Networks & Polypharmacology

I came across Takigawa et al where they address polypharmacology by investigating drug-target pairs.  Their approach is to simultaneously identify substructures from the ligand and subsequences from the target and combine this information to suggest drug-target pairs that represent some form of polypharmacology.  More specifically their hypothesis is that “polypharmacological principles” are embedded in a special set of paired fragments (substructures on the ligand side, subsequence on the target side). When you think about it, this is a more generalized (abstract?) version of a pharmacophore that makes the role of the target explicit.

Their approach originates from two assumptions

These results suggest that targets of promiscuous drugs can be dissimilar, implying that only a small part of each target is related with the principle of polypharmacology.

and

Similarly, recent research shows that smaller drugs in molecular weight are likely to be more promiscuous, suggesting that small fragments in each ligand would be a key to drug promiscuity

These lead to their hypothesis

… that paired fragments significantly shared in drug-target pairs could be crucial factors behind polypharmacology.

Based on this idea they first apply a frequent itemset algorithm to identify pairs of subgraph (SG) and subsequences (SS), that occur frequently (more than 5%) in the drug-target pairs. After identifying about 10,000 such SS-SG pairs, they define a sparse fingerprint, where each bit corresponds to one such pair. Using these fingerprints they then cluster the drug-target pairs, ending up with a selection of clusters. They then propose that individual clusters represent distinct polypharmacologies.

Our significant substructure pairs partitioned drug-target pairs covering most of approved drugs into clusters, which were clearly separated from each other, implying that each cluster corresponds to a unique polypharmacology type

While the underlying algorithms to obtain their results are nice, a lot of things weren’t clear.

Foremost, given the above quote, it’s not exactly clear from the paper what is meant by “unique polypharmacology type? Given that a cluster will consist of multiple drugs and multiple targets, it is not apparent from the text that a cluster highlights either promiscuity of compounds or ligand preferences for a small number of targets. While I think this is a major issue there are some other lesser problems

  • I get the impression that they consider promiscuity and polypharmacology as equivalent concepts. While there is a degree of similarity, I’d regard polypharmacology more as a rationally, controlled type of promiscuity
  • Most fragments they highlight in Figure 2 are relatively trivial paths. Certainly, reactive groups can lead to promiscuity; none of the subgraphs list exhibit reactive functionality and their application of the frequent itemset method, using a support of 5% could easily filter these out
  • Given they consider arbitrary subsequences of the target, the resulting associations could be meaningless. Again, it’d be interesting to note, in cases where crystal structure is available, how many of the subsequences, in the list of significant SS-SG pairs, lie in or around the binding site. A related question would be, of the SG-SS pairs associated with a cluster, how are individual subsequences distributed? Few unique subsequences could point towards a common binding site or active domain.
  • Related to the previous point, it’d be interesting to see in how many of the SG-SS paired fragments, the members correspond to actual interacting motifs (again based on crystal structure data).
  • One could argue that just using string subsequences to characterize the target misses information on important ligand-target interactions.

And while they may be the first to consider an analysis of drug-target pairs specifically, the idea of considering ligand and target simultaneously is not new. For example, the SiFT approach is quite similar and was described in 2004.

So, even though the paper seems pretty fuzzy on the supposed polypharmacology that they identify, it is overall an interesting paper (and one of the more interesting cheminformatics applications of frequent itemset methods).