# So much to do, so little time

Trying to squeeze sense out of chemical data

## HCSConnect, SOAP and images – the sordid details

Over the past few months I’ve been trying to work with the HCSConnect web service API that Cellomics provides as a means to programmatically access the Arrayscan image and data store. While the API works for getting things such as plate names, I was unable to retrieve images. After much poking and some suggestions from Rossella Rispoli (ICR), I put together a solution. Which (I think) is essentially a kludge due to the fact that the WSDL generated by the HCSConnect API is buggy.

The API is based on SOAP (yuk!) and so the first thing to do is to autogenerate a client. Since I work with Java, I used the AXIS libraries. Importantly, it seems that the only combination that works is AXIS1 with the BasicHTTPBinding (SOAP 1.1) from the API. AXIS2 + WSBinding (SOAP 1.2) do not work – I don’t know why.

Assuming you’ve got the AXIS1 libraries somewhere, here’s the sequence of steps to get a SOAP client generated:

 1234567891011 export AXIS_HOME=/path/to/axis export AXIS_LIB=$AXIS_HOME/lib export AXISCLASSPATH=$AXIS_LIB/axis.jar:$AXIS_LIB/commons-discovery-0.2.jar:$AXIS_LIB/commons-logging-1.0.4.jar:$AXIS_LIB/jaxrpc.jar:$AXIS_LIB/saaj.jar:$AXIS_LIB/log4j-1.2.8.jar:$AXIS_LIB/wsdl4j-1.5.1.jar java -cp "$AXISCLASSPATH" org.apache.axis.wsdl.WSDL2Java -p gov.nih.ncgc.arrayscan -B http://api.host.name:2020/?wsdl mkdir arrayscan mv build.xml gov arrayscan cd arrayscan ant mv "?wsdl.jar" HCSConnect-client-axis1.jar You’ll probably want to change the package name. Also note the use of port 2020 – this corresponds to the HTTP service. The resultant JAR file is what you should add to your project as a dependency. This library lets you connect to the API and start pulling data. But it will fail when you try to retrieve an image. And this because, the SOAP response that is returned when an image is requested provides the binary image data as an attachment. The autogenerated client code is unable to handle this (suggesting an issue in the WSDL specification for that method). To get around this, I needed to intercept the SOAP response and extract the image bytes myself. This can be achieved by creating an implementation of GenericHandler that only deals with the response from the GetImage method of the HCSConnect API. The code below does this and when it sees such a response, it extracts the image bytes and stores it in a static variable.  12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364 public class GetImageMessageHandler extends GenericHandler { HandlerInfo headerInfo; public static byte[] imageBytes; public void init(HandlerInfo info) { headerInfo = info; } public QName[] getHeaders() { return headerInfo.getHeaders(); } public boolean handleResponse(MessageContext context) { try { // get the soap header SOAPMessage message = ((org.apache.axis.MessageContext) context).getMessage(); String opName = ((org.apache.axis.MessageContext) context).getOperation().getName(); if (!opName.equals("GetImage")) return true; int nattach = 0; Iterator iter = message.getAttachments(); while (iter.hasNext()) { Object next = iter.next(); nattach++; } if (nattach != 1) throw new RuntimeException("If operation is GetImage, we must have 1 attachment"); // Create transformer TransformerFactory tff = TransformerFactory.newInstance(); Transformer tf = tff.newTransformer(); // Get reply content Source sc = message.getSOAPPart().getContent(); // Set output transformation StreamResult result = new StreamResult(System.out); tf.transform(sc, result); System.out.println(); iter = message.getAttachments(); while (iter.hasNext()) { AttachmentPart att = (AttachmentPart) iter.next(); BufferedInputStream bis = new BufferedInputStream(att.getDataHandler().getInputStream()); ByteArrayOutputStream baos = new ByteArrayOutputStream(); int c; while ((c = bis.read()) != -1) baos.write(c); bis.close(); imageBytes = baos.toByteArray(); } } catch (Exception e) { throw new JAXRPCException(e); } return true; } public boolean handleRequest(MessageContext context) { // return true to continue message processing return true; } } With this in hand, code that want’s to retrieve the image can use it as follows  123456 HCSConnectLocator csl = new HCSConnectLocator(); HandlerRegistry hr = csl.getHandlerRegistry(); List hc = hr.getHandlerChain(new QName("http://schemas.datacontract.org/2004/07/Thermo.Connect", "HTTPHCSConnect")); HandlerInfo hi = new HandlerInfo(); hi.setHandlerClass(GetImageMessageHandler.class); hc.add(hi); and then instead of doing  1 Image image = csl.getHTTPHCSConnect().getImage(idr); you’d do  1234567 try { csl.getHTTPHCSConnect().getImage(idr); } catch (AxisFault e) { FileOutputStream imageOutFile = new FileOutputStream("img.jpg"); imageOutFile.write(GetImageMessageHandler.imageBytes); imageOutFile.close(); } This is somewhat inelegant as the use of the static variable in GetImageMessageHandler means that we can’t use this class in a multi-threaded environment. However, it appears that the AXIS API instantiates instances of the handler class itself, rather than accepting an instance, so I don’t see an easy way around this. Written by Rajarshi Guha February 13th, 2013 at 11:30 pm Posted in software Tagged with , , , , ## High Content Screens and Multivariate Z’ without comments While contributing to a book chapter on high content screening I came across the problem of characterizing screen quality. In a traditional assay development scenario the Z factor (or Z’) is used as one of the measures of assay performance (using the positive and negative control samples). The definition of Z’ is based on a 1-D readout, which is the case with most non-high content screens. But what happens when we have to deal with 10 or 20 readouts, which can commonly occur in a high content screen? Assuming one has identified a small set of biologically relevant phenotypic parameters (from the tens or hundreds spit out by HCA software), it makes sense that one measure the assay performance in terms of the overall biology, rather than one specific aspect of the biology. In other words, a useful performance measure should be able to take into account multiple (preferably orthogonal) readouts. In fact, in many high content screening assays, the use of the traditional Z’ with a single readout leads to very low values suggesting a poor quality assay, when in fact, that is not the case if one were to consider the overall biology. One approach that has been described in the literature is an extension of the Z’, termed the multivariate Z’. The approach was first described by Kummel et al, which develops an LDA model, trained on the positive and negative wells. Each well is described by N phenotypic parameters and the assumption is that one has pre-selected these parameters to be meaningful and relevant. The key to using the model for a Z’ calculation is to replace the N-dimensional values for a given well by the 1-dimensional linear projection of that well: $P_i = \sum_{j=1}^{D} w_j x_{ij}$ where $P_i$ is the 1-D projected value, $w_j$ is the weight for the $j$’th pheontypic parameter and $x_{ij}$ is the value of the $j$’th parameter for the $i$’th well. The projected value is then used in the Z’ calculation as usual. Kummel et al showed that this approach leads to better (i.e., higher) Z’ values compared to the use of univariate Z’. Subsequently, Kozak & Csucs extended this approach and used a kernel method to project the N-dimensional well values in a non-linear manner. Unsurprisingly, they show a better Z’ than what would be obtained via a linear projection. And this is where I have my beef with these methods. In fact, a number of beefs: • These methods are model based and so can suffer from over-fitting. No checks were made and if over-fitting were to occur one would obtain a falsely optimistic Z’ • These methods assert success when they perform better than a univariate Z’ or when a non-linear projection does better than a linear projection. But neither comparison is a true indication that they have captured the assay performance in an absolute sense. In other words, what is the “ground truth” that one should be aiming for, when developing multivariate Z’ methods? Given that the upper bound of Z’ is 1.0, one can imagine developing methods that give you increasing Z’ values – but does a method that gives Z’ close to 1 really mean a better assay? It seems that published efforts are measured relative to other implementations and not necessarily to an actual assay quality (however that is characterized). • While the fundamental idea of separation of positive and negative control reponses as a measure of assay performance is good, methods that are based on learning this separation are at risk of generating overly optimistic assesments of performance. ## A counter-example As an example, I looked at a recent high content siRNA screen we ran that had 104 parameters associated with it. The first figure shows the Z’ calculated using each layer individually (excluding layers with abnormally low Z’) As you can see, the highest Z’ is about 0.2. After removing those with no variation and members of correlated pairs I ended up with a set of 15 phenotypic parameters. If we compare the per-parameter distributions of the positive and negative control responses, we see very poor separation in all layers but one, as shown in the density plots below (the scales are all independent) I then used these 15 parameters to build an LDA model and obtain a multivariate Z’ as described by Kummel et al. Now, the multivariate Z’ turns out to be 0.68, suggesting a well performing assay. I also performed MDS on the 15 parameter set to get lower dimensional (3D, 4D, 5D, 6D etc) datasets and performed the same calculation, leading to similar Z’ values (0.41 – 0.58) But in fact, from the biological point of view, the assay performance was quite poor due to poor performance of the positive control (we haven’t found a good one yet). In practice then, the model based multivariate Z’ (at least as described by Kummel et al can be misleading. One could argue that I had not chosen an appropriate set of phenotypic parameters – but I checkout a variety of other subsets (though not exhaustively) and I got similar Z’ values. ## Alternatives Of course, it’s easy to complain and while I haven’t worked out a rigorous alternative, the idea of describing the distance between multivariate distributions as a measure of assay performance (as opposed to learning the separation) allows us to attack the problem in a variety of ways. There is a nice discussion on StackExchange regarding this exact question. Some possibilities include It might be useful to perform a more comprehensive investigation of these methods as a way to measure assay performance Written by Rajarshi Guha September 9th, 2012 at 8:03 pm ## Accessing High Content Data from R without comments Over the last few months I’ve been getting involved in the informatics & data mining aspects of high content screening. While I haven’t gotten into image analysis itself (there’s a ton of good code and tools already out there), I’ve been focusing on managing image data and meta-data and asking interesting questions of the voluminuous, high-dimensional data that is generated by these techniques. One of our platforms is ImageXpress from Molecular Devices, which stores images in a file-based image store and meta data and numerical image features in an Oracle database. While they do provide an API to interact with the database it’s a Windows only DLL. But since much of modeling requires I access the data from R, I needed a more flexible solution. So, I’ve put together an R package that allows one to access numeric image data (i.e., descriptors) and images themselves. It depends on the ROracle package (which in turns requires an Oracle client installation). Currently the functionality is relatively limited, focusing on my common tasks. Thus for example, given assay plate barcodes, we can retrieve the assay ids that the plate is associated with and then for a given assay, obtain the cell-level image parameter data (or optionally, aggregate it to well-level data). This task is easily parallelizable – in fact when processing a high content RNAi screen, I make use of snow to speed up the data access and processing of 50 plates.  1234567891011121314 library(ncgchcs) con <- get.connection(user='foo', passwd='bar', sid='baz') plate.barcode <- 'XYZ1023' plate.id <- get.plates(con, plate.barcode) ## multiple analyses could be run on the same plate - we need ## to get the correct one (MX uses 'assay' to refer to an analysis run) ## so we first get details of analyses without retrieving the actual data details <- get.assay.by.barcode(con, barcode=plate.barcode, dry=TRUE) details <- subset(ret, PLATE_ID == plate.id & SETTINGS_NAME == assay.name) assay.id <- details$ASSAY_ID ## finally, get the analysis data, using median to aggregate cell-level data hcs.data <-  get.assay(con, assay.id, aggregate.func=median, verbose=FALSE, na.rm=TRUE)

Alternatively, given a plate id (this is the internal MetaXpress plate id) and a well location, one can obtain the path to the relevant image(s). With the images in hand, you could use EBImage to perform image processing entirely in R.

 123456 library(ncgchcs) ## will want to set IMG.STORE.LOC to point to your image store con <- get.connection(user='foo', passwd='bar', sid='baz') plate.barcode <- 'XYZ1023' plate.id <- get.plates(con, plate.barcode) get.image.path(con, plate.id, 4, 4) ## get images for all sites & wavelengths

Currently, you cannot get the internal plate id based on the user assigned plate name (which is usually different from the barcode). Also the documentation is non-existant, so you need to explore the package to learn the functions. If there’s interest I’ll put in Rd pages down the line. As a side note, we also have a Java interface to the MetaXpress database that is being used to drive a REST interface to make our imaging data accessible via the web.

Of course, this is all specific to the ImageXpress platform – we have others such as InCell and Acumen. To have a comprehensive solution for all our imaging, I’m looking at the OME infrastructure as a means of, at the very least, have a unified interface to the images and their meta data.

Written by Rajarshi Guha

May 27th, 2011 at 5:01 am

Posted in software,Uncategorized

Tagged with , , ,

## Call for Papers: High Content Screening: Exploring Relationships Between Small Molecules and Phenotypic Results

242nd ACS National Meeting
Denver, Aug 28 – Sept 1, 2011
CINF Division

Dear Colleagues, we are organizing an ACS symposium, focusing on the use of High Content Screening (HCS) for small molecule applications. High content screens, while resource intensive, are capable of providing a detailed view of the phenotypic effects of small molecules. Traditional reporter based screens are characterized by a one-dimensional signal. In contrast, high content screens generate rich, multi-dimensional datasets that allow for wide-ranging and in-depth analysis of various aspects of chemical biology including mechanisms of action, target identification and so on. Recent developments in high-throughput HCS pose significant challenges throughout the screening pipeline ranging from assay design and miniaturization to data management and analysis. Underlying all of this is the desire to connect chemical structure to phenotypic effects.

We invite you to submit contributions highlighting novel work and new developments in High Content Screening (HCS), High Content Analysis (HCA), and data exploration as it relates to the field of small molecules. Topics of interest include but are not limited to

• Compound & in silico screening for drug discovery
• Compound profiling by high content analysis
• Chemistry & probes in imaging
• Lead discovery strategies – one size fits all or horses for courses?
• Application of HCA in discovering toxicology screening strategies
• Novel data mining approaches for HCS data that link phenotypes to chemical structures
• Software & informatics for HCS data management and integration
In addition to these topics special consideration will be given to contributions that present contributions in in-silico exploration based on HCS data. We would also like to point out that sponsorship opportunities are available. The deadline for abstract submissions is April 1, 2011. All abstracts should be submitted via PACS at http://abstracts.acs.org. If you have any questions feel free to contact Tim or myself.

Tim Moran
Accelrys
tmoran@accelrys.com
+1 858 799 5609

Rajarshi Guha
NIH Chemical Genomics Center
guhar@mail.nih.gov
+1 814 404 5449

Written by Rajarshi Guha

March 24th, 2011 at 12:26 pm

Posted in research,software

Tagged with , , , ,

## Lots of Pretty Pictures

Yesterday I attended the High Content Analysis conference in San Francisco. Over the last few months I’ve been increasingly involved in the analysis of high content screens, both for small molecules and siRNA. This conference gave me the opportunity to meet people working in the field as well as present some of our recent work on an automated screening methodology integrating primary and secondary screens into a single workflow.

The panel discussion was interesting, though I was surprised that standards was a major issue. Data management and access is certainly a major problem in this field, given that a single screen can generate TB’s of image data plus millions to even billions of rows of cell-level data. The cloud did come up, but I’m not sure how smooth a workflow would be involving cloud operations.

Some of the talks were interesting such as the presentation on OME by Jason Swedlow. The talk that really caught my eye was by Ilya Goldberg on their work with WND-CHARM. In contrast to traditional analysis of high content screens which involves cell segmentation and subsequent object identification, he tackles the problem by consider the image itself as the object. Thus rather than evaluate phenotypic descriptors for individual cells, he evaluates descrptors such as texttures, Haralick features etc., for an entire image of a well. With these descriptors he then develops classification models using LDA – which does surprisingly well (in that SVM’s don’t do a whole lot better!). The approach is certainly attractive as image segmentation can get qute hairy. At the same time, the method requires pretty good performance on the control wells. Currently, I’ve been following the traditional HCA workflow – which has worked quite well in terms of classification performance. However, this technique is certainly one to look into, as it could avoid some of the subjectivity involved in segmentation based workflows.

As always, San Francisco is a wonderful place – weather, food and feel. Due to my short stay I could only sample one resteraunt – a tapas bar called Lalola. A fantastic place with a mind blowing mushroom tapas and the best sangria I’ve had so far. Highly recommended.

Written by Rajarshi Guha

January 13th, 2011 at 5:51 pm

Posted in software,visualization

Tagged with , , ,