So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘software’ Category

Lots of Pretty Pictures

without comments

Yesterday I attended the High Content Analysis conference in San Francisco. Over the last few months I’ve been increasingly involved in the analysis of high content screens, both for small molecules and siRNA. This conference gave me the opportunity to meet people working in the field as well as present some of our recent work on an automated screening methodology integrating primary and secondary screens into a single workflow.

The panel discussion was interesting, though I was surprised that standards was a major issue. Data management and access is certainly a major problem in this field, given that a single screen can generate TB’s of image data plus millions to even billions of rows of cell-level data. The cloud did come up, but I’m not sure how smooth a workflow would be involving cloud operations.

Some of the talks were interesting such as the presentation on OME by Jason Swedlow. The talk that really caught my eye was by Ilya Goldberg on their work with WND-CHARM. In contrast to traditional analysis of high content screens which involves cell segmentation and subsequent object identification, he tackles the problem by consider the image itself as the object. Thus rather than evaluate phenotypic descriptors for individual cells, he evaluates descrptors such as texttures, Haralick features etc., for an entire image of a well. With these descriptors he then develops classification models using LDA – which does surprisingly well (in that SVM’s don’t do a whole lot better!). The approach is certainly attractive as image segmentation can get qute hairy. At the same time, the method requires pretty good performance on the control wells. Currently, I’ve been following the traditional HCA workflow – which has worked quite well in terms of classification performance. However, this technique is certainly one to look into, as it could avoid some of the subjectivity involved in segmentation based workflows.

As always, San Francisco is a wonderful place – weather, food and feel. Due to my short stay I could only sample one resteraunt – a tapas bar called Lalola. A fantastic place with a mind blowing mushroom tapas and the best sangria I’ve had so far. Highly recommended.

Written by Rajarshi Guha

January 13th, 2011 at 5:51 pm

Posted in software,visualization

Tagged with , , ,

Wikipedia Category Hierarchy via N-triples

without comments

For a current project I needed to obtain a hierarchical representation of Wikipedia categories. (which can be explored here). Pierre Lindenbaum provided some useful pointers on using the Mediawiki API. However, this was a little unweildy. Instead, I  came across the DBpedia downloads. More specifically, the SKOS categories files provide the links between categories using the SKOS vocabulary in N-triple format. It’s thus relatively easy to read in the triples and recursively determine the parent-child relationships.

I put together some quick Python code to obtain the parent-child relationships for all categories starting from Category:Proteins. The code is based on ntriples.py. We start of with some classes to handle triples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from ntriples import *
import sys

class Triple:
    """
     A simplistic representation of a triple
    """

    def __init__(self, s, p, o):
        self._s = s
        self._p = p
        self._o = o
    def __repr__(self): return '%s, %s, %s' % (self._s, self._p, self._o)
    def subject(self): return self._s
    def predicate(self): return self._p
    def object(self): return self._o

class MySink:
    """
     This class stores the triples as they are parsed from the file
    """

    def __init__(self):
        self._triples = []

    def triple(self, s, p, o):
        self._triples.append( Triple(s,p,o) )
       
    def __len__(self): return len(self._triples)

    def getTriples(self): return self._triples

Loading in the triples is then as simple as

1
2
3
p = NTriplesParser(sink=MySink())
sink = p.parse(open(sys.argv[1]))
ts = sink.getTriples()

This results in a list of Triple objects. Before building the hierarchy we remove triples that are not of interest (specifically those with a predicate of “#type” or “#prefLabel. This is relatively easy via filter

1
ts = filter(lambda x: x.predicate().split("#")[1] not in ('type', "prefLabel"), ts)

With these triples in hand, we can start building the hierarchy. We first identify those triples whose object is the Proteins category (<http://dbpedia.org/resource/Category:Proteins>) and predicate is the “broader” relation from the SKOS vocabulary (<http://www.w3.org/2004/02/skos/core#broader>) – these triples are the first level children. We then iterate over each of them and recursively identify their children.

1
2
3
4
5
6
7
8
9
10
11
12
13
protein_children = filter(lambda x: x.object().endswith("Category:Proteins"), ts)

def recurseChildren(query):
    c = filter(lambda x: x.object() == query.subject(), ts)
    if len(c) == 0: return []
    else:
        ret = []
        for i in c: ret.append( (i, recurseChildren(i)) )
        return ret

root = []
for child in protein_children:
    root.append( (child, recurseChildren(child)) )

Taking the first 300,000 triples from the SKOS categories file lets us build a partial hierarchy, which I’ve shown below. With this code in hand, I can now build the full hierarchy using all 2.2M triples) and identify the actual pages associated with each category (once again, using DBpedia)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
  Enzymes
      Viral_enzymes
  Receptors
      Transmembrane_receptors
          7TM_receptors
              G_protein_coupled_receptors
          Tyrosine_kinase_receptors
          Ionotropic_receptors
      Sensory_receptors
          Photoreceptor_cells
      Intracellular_receptors
  Membrane_proteins
      Integral_membrane_proteins
      Peripheral_membrane_proteins
          G_proteins
          Lantibiotics
  Protein_structure
      Protein_structural_motifs
      Protein_domains
  Heat_shock_proteins
  Glycoproteins
  Serine_protease_inhibitors
  Prions
  Growth_factors
  Lipoproteins
  Cytokines
  Protein_images
  Metalloproteins
      Iron-sulfur_proteins
      Hemoproteins
  Cytoskeleton
      Motor_proteins
      Structural_proteins
          Keratins
  Motor_proteins
  Protein_methods
  Structural_proteins
      Keratins
  Protein_domains
  Cell_adhesion_proteins
  Clusters_of_differentiation

Written by Rajarshi Guha

January 2nd, 2011 at 7:28 am

Posted in software

Tagged with , , , ,

Visualizing PAINS SMARTS

without comments

A few days ago I had made available a SMARTS version of the PAINS substructural filters, that were converted using CACTVS from the original SLN patterns. I had mentioned that the SMARTSViewer application was a handy way to visualize the complex SMARTS patterns. Matthias Rarey let me know that his student had converted all the SMARTS to SMARTSViewer depictions and made them available as a PDF. Given the complexity of many of the PAINS patterns, these depictions are a very nice way to get a quick idea of what is supposed to match.

(FWIW, the SMARTS don’t reproduce the matches obtained using the original SLN’s – but hopefully the depictions will help anybody who’d like to try and fix the SMARTS).

Written by Rajarshi Guha

December 2nd, 2010 at 1:18 am

Posted in cheminformatics,software

Tagged with , ,

Similarity Matrices in Parallel

without comments

Today I got an email asking whether it’d be possible to speed up a fingerprint similarity matrix calculation in R. Now, pairwise similarity matrix calculations (whether they’re for molecules or sequences or anything else) are by definition quadratic in nature. So performing these calculations for large collections aren’t always feasible – in many cases, it’s worthwhile to rethink the problem.

But for those situations where you do need to evaluate it, a simple way to parallelize the calculation is to evaluate the similarity of each molecule with all the rest in parallel. This means each process/thread must have access to the entire set of fingerprints. So again, for very large collections, this is not always practical. However, for small collections parallel evaluation can lead to speed ups.

The fingerprint package provides a method to directly get the similarity matrix for a set of fingerprints, but this is implemented in interpreted R so is not very fast. Given a list of fingerprints, a manual evaluation of the similarity matrix can be done using nested lapply’s:

1
2
3
4
library(fingerprint)
sims <- lapply(fps, function(x) {
  unlist(lapply(fps, function(y) distance(x,y)))
})

For 1012 fingerprints, this takes 286s on my Macbook Pro (4GB, 2.4 GHz). Using snow, we can convert this to a parallel version, which takes 172s on two cores:

1
2
3
4
5
6
7
8
library(fingerprint)
library(snow)
cl <- makeCluster(4, type = "SOCK")
clusterEvalQ(cl, library(fingerprint))
clusterExport(cl, "fps")
sim <- parLapply(cl, fps, function(x) {
  unlist(lapply(fps, function(y) distance(x,y)))
})

Written by Rajarshi Guha

December 2nd, 2010 at 1:03 am

Inserting 2D Depictions into R Plots

without comments

Recent versions of rcdk allow you to insert images of chemical structures into R plots, via the view.image.2d and rasterImage functions. One problem with the latter function is that the 2D structure image must be located in plot units, rather than pixel units. Paul Murrell suggested an easy way to insert the raster image into the plot region, maintaining the  native resolution of the image:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(rcdk)
m <- parse.smiles("O=C(C1=CC=CC=C1)C1=CC=CC=C1")[[1]]
img <- view.image.2d(m, 200,200)
plot(10:1, pch=19)

## Position the depiction at the lower left corner
dpi <- (par("cra")/par("cin"))[1]
usr <- par("usr")
xl <- usr[1]
yb <- usr[3]
xr <- xl + xinch(200/dpi)
yt <- yb + yinch(200/dpi)

rasterImage(img, xl,yb, xr,yt)

Written by Rajarshi Guha

November 20th, 2010 at 5:09 pm

Posted in cheminformatics,software

Tagged with , ,