So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘software’ Category

Call for Papers: High Content Screening: Exploring Relationships Between Small Molecules and Phenotypic Results

without comments

242nd ACS National Meeting
Denver, Aug 28 – Sept 1, 2011
CINF Division

Dear Colleagues, we are organizing an ACS symposium, focusing on the use of High Content Screening (HCS) for small molecule applications. High content screens, while resource intensive, are capable of providing a detailed view of the phenotypic effects of small molecules. Traditional reporter based screens are characterized by a one-dimensional signal. In contrast, high content screens generate rich, multi-dimensional datasets that allow for wide-ranging and in-depth analysis of various aspects of chemical biology including mechanisms of action, target identification and so on. Recent developments in high-throughput HCS pose significant challenges throughout the screening pipeline ranging from assay design and miniaturization to data management and analysis. Underlying all of this is the desire to connect chemical structure to phenotypic effects.

We invite you to submit contributions highlighting novel work and new developments in High Content Screening (HCS), High Content Analysis (HCA), and data exploration as it relates to the field of small molecules. Topics of interest include but are not limited to

  • Compound & in silico screening for drug discovery
  • Compound profiling by high content analysis
  • Chemistry & probes in imaging
  • Lead discovery strategies – one size fits all or horses for courses?
  • Application of HCA in discovering toxicology screening strategies
  • Novel data mining approaches for HCS data that link phenotypes to chemical structures
  • Software & informatics for HCS data management and integration
In addition to these topics special consideration will be given to contributions that present contributions in in-silico exploration based on HCS data. We would also like to point out that sponsorship opportunities are available. The deadline for abstract submissions is April 1, 2011. All abstracts should be submitted via PACS at http://abstracts.acs.org. If you have any questions feel free to contact Tim or myself.

Tim Moran
Accelrys
tmoran@accelrys.com
+1 858 799 5609

Rajarshi Guha
NIH Chemical Genomics Center
guhar@mail.nih.gov
+1 814 404 5449

Written by Rajarshi Guha

March 24th, 2011 at 12:26 pm

Posted in research,software

Tagged with , , , ,

CDK & logP Values

with 2 comments

Recently, Tony Williams enquired whether there had been any comparisons of the CDK with other tools for the calculation of polar surface area (PSA) and logP. Given that PSA calculations using the fragments defined by Ertl et al are pretty straightforward, it’s not surprising that the CDK implementation matches very well with the ACD Labs implementation (based on 57,000 molecules). More interesting however is the performance of different logP methods on experimental data. (Note that Mannhold et al performed a very comprehensive comparison of logP predictors. This post just focuses on the CDK).

To that end I evaluated logP values for ~ 10,000 molecules from the (proprietary) logPstar dataset, using the CDK’s XLogP implementation, ACD Labs (v12) and ChemAxon (c5.2.1_1). As can be seen from the plots, ACD performs best and the XLogP method fairs quite poorly. In all cases, default settings were used. In addition the CDK has an implementation of ALogP, but it performed so poorly that I don’t list it here.

Given that the ACD predictions are based on a neural network model, I was interested in how well a predictive model based on CDK descriptors would perform when trained on this dataset. Since this was just a quick exploration, I didn’t put too much effort into the model building process. So I evaluated a set of CDK topological and constitutional descriptors and performed minimal feature selection to remove those descriptors with undefined values – giving a final pool of 111 descriptors.

I split the dataset into a training and prediction set (60/40 split) and then threw them into a random forest model, which performs implicit feature selection and doesn’t overfit. As the plot shows, the performance is significantly better than XLogP (training set R2 = 0.87 and prediction set R2 = 0.86). Multiple training/prediction set splits gave similar results.

While it’s not as good as the ACD model, it was obtained using about 20 minutes of effort. Certainly, moving to a neural network or SVM model coupled with an explicit feature selection approach should lead to further improvements in the performance of this model.

Written by Rajarshi Guha

February 3rd, 2011 at 2:53 am

Caching SMARTS Queries

with 3 comments

Andrew Dalke recently published a detailed write up on his implementation of the Pubchem fingerprints and provided a pretty thorough comparison with the CDK implementation. He pointed out a number of bugs in the CDK version; but he also noted that performance could be improved by caching parsed SMARTS queries – which are used extensively in this fingerprinter. So I wanted to see whether caching really helps.

The CDK SMARTS parser is a JJTree based implementation and my anecdotal evidence suggested that SMARTS parsing was not a real bottleneck. But of course, nothing beats measurements. So I modified the SMARTSQueryTool class to cache parsed SMARTS queries using a simple LRU cache (patch):

1
2
3
4
5
6
final int MAX_ENTRIES = 20;
Map<String, QueryAtomContainer> cache = new LinkedHashMap<String, QueryAtomContainer>(MAX_ENTRIES + 1, .75F, true) {
    public boolean removeEldestEntry(Map.Entry eldest) {
        return size() > MAX_ENTRIES;
    }
};

The map is keyed on the SMARTS string. Then, when we attempt to parse a new SMARTS query, we check if it’s in the cache and if not, parse it and place it in the cache.

Now, the thing about this caching mechanism is that after 20 SMARTS queries have been cached, the least recently used one is replaced with a new one. As a result, if we perform matching with 40 unique SMARTS (in sequence) only the last 20 get cached, for a given molecule. But when we go to do it on a new molecule, the first 20 are not in the cache and hence we shouldn’t really benefit from the caching scheme. In general, if the fingerprinter (or any caller of the parser) will perform N unique SMARTS queries for a single molecule, the cache size must be at least N, for the cache to be used for subsequent molecules.

I implemented a quick test harness, reading in 100 SMILES and then generating Pubchem fingerprints for each molecule. The fingerprint generation was repeated 5 times and the time reported for each round. The box plot shows the distribution of the timings. Now, the CDK implementation has 621 SMARTS patterns – as you can see, we only get a benefit from the caching when the cache size is 700. In fact, cache sizes below that lead to a performance hit  - I assume due to time required to query the map.

While the performance improvement is not dramatic it is close to 10% compared to no-caching at all. In actuality, the major bottleneck in the SMARTS parser is the actual graph isomorphism step (which we hope to drastically improve by using the SMSD code). Maybe then, SMARTS parsing will take a bigger fraction of the time. Also keep in mind that due to the heavyweight nature of CDK molecule objects, very large caches could be a strain on memory. But to check that out, I should use a profiler.

Written by Rajarshi Guha

January 23rd, 2011 at 12:56 am

Lots of Pretty Pictures

without comments

Yesterday I attended the High Content Analysis conference in San Francisco. Over the last few months I’ve been increasingly involved in the analysis of high content screens, both for small molecules and siRNA. This conference gave me the opportunity to meet people working in the field as well as present some of our recent work on an automated screening methodology integrating primary and secondary screens into a single workflow.

The panel discussion was interesting, though I was surprised that standards was a major issue. Data management and access is certainly a major problem in this field, given that a single screen can generate TB’s of image data plus millions to even billions of rows of cell-level data. The cloud did come up, but I’m not sure how smooth a workflow would be involving cloud operations.

Some of the talks were interesting such as the presentation on OME by Jason Swedlow. The talk that really caught my eye was by Ilya Goldberg on their work with WND-CHARM. In contrast to traditional analysis of high content screens which involves cell segmentation and subsequent object identification, he tackles the problem by consider the image itself as the object. Thus rather than evaluate phenotypic descriptors for individual cells, he evaluates descrptors such as texttures, Haralick features etc., for an entire image of a well. With these descriptors he then develops classification models using LDA – which does surprisingly well (in that SVM’s don’t do a whole lot better!). The approach is certainly attractive as image segmentation can get qute hairy. At the same time, the method requires pretty good performance on the control wells. Currently, I’ve been following the traditional HCA workflow – which has worked quite well in terms of classification performance. However, this technique is certainly one to look into, as it could avoid some of the subjectivity involved in segmentation based workflows.

As always, San Francisco is a wonderful place – weather, food and feel. Due to my short stay I could only sample one resteraunt – a tapas bar called Lalola. A fantastic place with a mind blowing mushroom tapas and the best sangria I’ve had so far. Highly recommended.

Written by Rajarshi Guha

January 13th, 2011 at 5:51 pm

Posted in software,visualization

Tagged with , , ,

Wikipedia Category Hierarchy via N-triples

without comments

For a current project I needed to obtain a hierarchical representation of Wikipedia categories. (which can be explored here). Pierre Lindenbaum provided some useful pointers on using the Mediawiki API. However, this was a little unweildy. Instead, I  came across the DBpedia downloads. More specifically, the SKOS categories files provide the links between categories using the SKOS vocabulary in N-triple format. It’s thus relatively easy to read in the triples and recursively determine the parent-child relationships.

I put together some quick Python code to obtain the parent-child relationships for all categories starting from Category:Proteins. The code is based on ntriples.py. We start of with some classes to handle triples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from ntriples import *
import sys

class Triple:
    """
     A simplistic representation of a triple
    """

    def __init__(self, s, p, o):
        self._s = s
        self._p = p
        self._o = o
    def __repr__(self): return '%s, %s, %s' % (self._s, self._p, self._o)
    def subject(self): return self._s
    def predicate(self): return self._p
    def object(self): return self._o

class MySink:
    """
     This class stores the triples as they are parsed from the file
    """

    def __init__(self):
        self._triples = []

    def triple(self, s, p, o):
        self._triples.append( Triple(s,p,o) )
       
    def __len__(self): return len(self._triples)

    def getTriples(self): return self._triples

Loading in the triples is then as simple as

1
2
3
p = NTriplesParser(sink=MySink())
sink = p.parse(open(sys.argv[1]))
ts = sink.getTriples()

This results in a list of Triple objects. Before building the hierarchy we remove triples that are not of interest (specifically those with a predicate of “#type” or “#prefLabel. This is relatively easy via filter

1
ts = filter(lambda x: x.predicate().split("#")[1] not in ('type', "prefLabel"), ts)

With these triples in hand, we can start building the hierarchy. We first identify those triples whose object is the Proteins category (<http://dbpedia.org/resource/Category:Proteins>) and predicate is the “broader” relation from the SKOS vocabulary (<http://www.w3.org/2004/02/skos/core#broader>) – these triples are the first level children. We then iterate over each of them and recursively identify their children.

1
2
3
4
5
6
7
8
9
10
11
12
13
protein_children = filter(lambda x: x.object().endswith("Category:Proteins"), ts)

def recurseChildren(query):
    c = filter(lambda x: x.object() == query.subject(), ts)
    if len(c) == 0: return []
    else:
        ret = []
        for i in c: ret.append( (i, recurseChildren(i)) )
        return ret

root = []
for child in protein_children:
    root.append( (child, recurseChildren(child)) )

Taking the first 300,000 triples from the SKOS categories file lets us build a partial hierarchy, which I’ve shown below. With this code in hand, I can now build the full hierarchy using all 2.2M triples) and identify the actual pages associated with each category (once again, using DBpedia)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
  Enzymes
      Viral_enzymes
  Receptors
      Transmembrane_receptors
          7TM_receptors
              G_protein_coupled_receptors
          Tyrosine_kinase_receptors
          Ionotropic_receptors
      Sensory_receptors
          Photoreceptor_cells
      Intracellular_receptors
  Membrane_proteins
      Integral_membrane_proteins
      Peripheral_membrane_proteins
          G_proteins
          Lantibiotics
  Protein_structure
      Protein_structural_motifs
      Protein_domains
  Heat_shock_proteins
  Glycoproteins
  Serine_protease_inhibitors
  Prions
  Growth_factors
  Lipoproteins
  Cytokines
  Protein_images
  Metalloproteins
      Iron-sulfur_proteins
      Hemoproteins
  Cytoskeleton
      Motor_proteins
      Structural_proteins
          Keratins
  Motor_proteins
  Protein_methods
  Structural_proteins
      Keratins
  Protein_domains
  Cell_adhesion_proteins
  Clusters_of_differentiation

Written by Rajarshi Guha

January 2nd, 2011 at 7:28 am

Posted in software

Tagged with , , , ,