Lots of Pretty Pictures

Yesterday I attended the High Content Analysis conference in San Francisco. Over the last few months I’ve been increasingly involved in the analysis of high content screens, both for small molecules and siRNA. This conference gave me the opportunity to meet people working in the field as well as present some of our recent work on an automated screening methodology integrating primary and secondary screens into a single workflow.

The panel discussion was interesting, though I was surprised that standards was a major issue. Data management and access is certainly a major problem in this field, given that a single screen can generate TB’s of image data plus millions to even billions of rows of cell-level data. The cloud did come up, but I’m not sure how smooth a workflow would be involving cloud operations.

Some of the talks were interesting such as the presentation on OME by Jason Swedlow. The talk that really caught my eye was by Ilya Goldberg on their work with WND-CHARM. In contrast to traditional analysis of high content screens which involves cell segmentation and subsequent object identification, he tackles the problem by consider the image itself as the object. Thus rather than evaluate phenotypic descriptors for individual cells, he evaluates descrptors such as texttures, Haralick features etc., for an entire image of a well. With these descriptors he then develops classification models using LDA – which does surprisingly well (in that SVM’s don’t do a whole lot better!). The approach is certainly attractive as image segmentation can get qute hairy. At the same time, the method requires pretty good performance on the control wells. Currently, I’ve been following the traditional HCA workflow – which has worked quite well in terms of classification performance. However, this technique is certainly one to look into, as it could avoid some of the subjectivity involved in segmentation based workflows.

As always, San Francisco is a wonderful place – weather, food and feel. Due to my short stay I could only sample one resteraunt – a tapas bar called Lalola. A fantastic place with a mind blowing mushroom tapas and the best sangria I’ve had so far. Highly recommended.

Wikipedia Category Hierarchy via N-triples

For a current project I needed to obtain a hierarchical representation of Wikipedia categories. (which can be explored here). Pierre Lindenbaum provided some useful pointers on using the Mediawiki API. However, this was a little unweildy. Instead, I came across the DBpedia downloads. More specifically, the SKOS categories files provide the links between categories using the SKOS vocabulary in N-triple format. It’s thus relatively easy to read in the triples and recursively determine the parent-child relationships.

I put together some quick Python code to obtain the parent-child relationships for all categories starting from Category:Proteins. The code is based on ntriples.py. We start of with some classes to handle triples.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

from ntriples import *
import sys

class Triple:
"""
A simplistic representation of a triple
"""
def __init__(self, s, p, o):
self._s = s
self._p = p
self._o = o
def __repr__(self): return '%s, %s, %s' % (self._s, self._p, self._o)
def subject(self): return self._s
def predicate(self): return self._p
def object(self): return self._o

class MySink:
"""
This class stores the triples as they are parsed from the file
"""
def __init__(self):
self._triples = []

def triple(self, s, p, o):
self._triples.append( Triple(s,p,o) )

def __len__(self): return len(self._triples)

def getTriples(self): return self._triples

Loading in the triples is then as simple as

1
2
3

p = NTriplesParser(sink=MySink())
sink = p.parse(open(sys.argv[1]))
ts = sink.getTriples()

This results in a list of Triple objects. Before building the hierarchy we remove triples that are not of interest (specifically those with a predicate of “#type” or “#prefLabel. This is relatively easy via filter

1	ts = filter(lambda x: x.predicate().split("#")[1] not in ('type', "prefLabel"), ts)

With these triples in hand, we can start building the hierarchy. We first identify those triples whose object is the Proteins category (<http://dbpedia.org/resource/Category:Proteins>) and predicate is the “broader” relation from the SKOS vocabulary (<http://www.w3.org/2004/02/skos/core#broader>) – these triples are the first level children. We then iterate over each of them and recursively identify their children.

1
2
3
4
5
6
7
8
9
10
11
12
13

protein_children = filter(lambda x: x.object().endswith("Category:Proteins"), ts)

def recurseChildren(query):
c = filter(lambda x: x.object() == query.subject(), ts)
if len(c) == 0: return []
else:
ret = []
for i in c: ret.append( (i, recurseChildren(i)) )
return ret

root = []
for child in protein_children:
root.append( (child, recurseChildren(child)) )

Taking the first 300,000 triples from the SKOS categories file lets us build a partial hierarchy, which I’ve shown below. With this code in hand, I can now build the full hierarchy using all 2.2M triples) and identify the actual pages associated with each category (once again, using DBpedia)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

Enzymes
Viral_enzymes
Receptors
Transmembrane_receptors
7TM_receptors
G_protein_coupled_receptors
Tyrosine_kinase_receptors
Ionotropic_receptors
Sensory_receptors
Photoreceptor_cells
Intracellular_receptors
Membrane_proteins
Integral_membrane_proteins
Peripheral_membrane_proteins
G_proteins
Lantibiotics
Protein_structure
Protein_structural_motifs
Protein_domains
Heat_shock_proteins
Glycoproteins
Serine_protease_inhibitors
Prions
Growth_factors
Lipoproteins
Cytokines
Protein_images
Metalloproteins
Iron-sulfur_proteins
Hemoproteins
Cytoskeleton
Motor_proteins
Structural_proteins
Keratins
Motor_proteins
Protein_methods
Structural_proteins
Keratins
Protein_domains
Cell_adhesion_proteins
Clusters_of_differentiation

2nd Call for Papers – ICCS, 2011

This has already been posted on some mailing lists, but one more place can’t hurt. The International Conference on Chemical Structures (ICCS) is coming up in June, 2011 at Noordwijkerhout, The Netherlands. I’m on the scientific advisory board and am planning to attend this meeting, as the topics being covered look pretty interesting, especially those focusing on ‘systems’ aspects of cheminformatics and bioinformatics. The abstract submission deadline is January 31, 2011.

C A L L F O R P A P E R S
9th International Conference on Chemical Structures
NH Leeuwenhorst Conference Hotel,
Noordwijkerhout, The Netherlands

5-9 June 2011

Visit the conference website at www.int-conf-chem-structures.org for
more information.

The 9th International Conference on Chemical Structures (ICCS) is
seeking presentations of novel research and emerging technologies for
the following plenary sessions:

o Cheminformatics
> advances in structure representation
> reaction handling and electronic lab notebooks (ELNs)
> molecular similarity and diversity
> chemical information visualization

o Structure-Activity and Structure-Property Prediction
> graphical methods for SAR analysis
> industrialized and large-scale model building
> multi-property prediction and multi-objective optimization

o Structure-Based Drug Design and Virtual Screening
> new docking and scoring approaches
> improved understanding of protein-ligand interactions
> pharmacophore definition and search
> modeling of challenging targets

o Analysis of Large Chemistry Spaces
> mining of chemical literature and patents
> design, profiling and comparison of compound collections and screening sets
> machine learning and knowledge extraction from databases

o Integrated Chemical Information
> advances in chemogenomics
> integration of medical and biological information
> semantic technologies as a driver of integration
> translational informatics

o Dealing with Biological Complexity
> analysis and prediction of poly-pharmacology
> in-silico analysis of toxicology, drug safety, and adverse events
> pathways and biological networks
> druggability of targets

Before and after the official conference program free workshops will be
offered by several companies including BioSolveIT (www.biosolveit.de)
and the Chemical Computing Group (www.chemcomp.com).

Joint Organizers:
o Division of Chemical Information of the American Chemical Society
(CINF)
o Chemical Structure Association Trust (CSA Trust)
o Division of Chemical Information and Computer Science of the
Chemical Society of Japan (CSJ)
o Chemistry-Information-Computer Division of the Society of German
Chemists (GDCh)
o Royal Netherlands Chemical Society (KNCV)
o Chemical Information Group of the Royal Society of Chemistry (RSC)
o Swiss Chemical Society (SCS)

We encourage the submission of papers on both applications and case
studies as well as on method development and algorithmic work. The final
program will be a balance of these two aspects.

From the submissions the program committee and the scientific advisory
board will select about 30 papers for the plenary sessions. All submissions
that cannot be included in the plenary sessions will automatically be
considered for the poster session.

Contributions can be submitted for any of the above and related areas,
but we also welcome contributions in any aspect of the computer handling
of chemical structure information, such as:

o automatic structure elucidation
o combinatorial chemistry, diversity analysis
o web technology and its effect on chemical information
o electronic publishing
o MM or QM/MM simulations
o practical free energy calculations
o modeling of ADME properties
o material sciences
o analysis and prediction of crystal structures
o grid and cloud computing in cheminformatics

Visit the conference website at http://www.int-conf-chem-structures.org for
more information, including details on procedures for online abstract
submission and conference registration.

The deadline for the submission of abstracts is 31 January 2011.

We hope to see you in Noordwijkerhout.

Keith T Taylor, ICCS Chair
Markus Wagener, ICCS Co-Chair

Visualizing PAINS SMARTS

A few days ago I had made available a SMARTS version of the PAINS substructural filters, that were converted using CACTVS from the original SLN patterns. I had mentioned that the SMARTSViewer application was a handy way to visualize the complex SMARTS patterns. Matthias Rarey let me know that his student had converted all the SMARTS to SMARTSViewer depictions and made them available as a PDF. Given the complexity of many of the PAINS patterns, these depictions are a very nice way to get a quick idea of what is supposed to match.

(FWIW, the SMARTS don’t reproduce the matches obtained using the original SLN’s – but hopefully the depictions will help anybody who’d like to try and fix the SMARTS).

Similarity Matrices in Parallel

Today I got an email asking whether it’d be possible to speed up a fingerprint similarity matrix calculation in R. Now, pairwise similarity matrix calculations (whether they’re for molecules or sequences or anything else) are by definition quadratic in nature. So performing these calculations for large collections aren’t always feasible – in many cases, it’s worthwhile to rethink the problem.

But for those situations where you do need to evaluate it, a simple way to parallelize the calculation is to evaluate the similarity of each molecule with all the rest in parallel. This means each process/thread must have access to the entire set of fingerprints. So again, for very large collections, this is not always practical. However, for small collections parallel evaluation can lead to speed ups.

The fingerprint package provides a method to directly get the similarity matrix for a set of fingerprints, but this is implemented in interpreted R so is not very fast. Given a list of fingerprints, a manual evaluation of the similarity matrix can be done using nested lapply’s:

1
2
3
4

library(fingerprint)
sims <- lapply(fps, function(x) {
unlist(lapply(fps, function(y) distance(x,y)))
})

For 1012 fingerprints, this takes 286s on my Macbook Pro (4GB, 2.4 GHz). Using snow, we can convert this to a parallel version, which takes 172s on two cores:

1
2
3
4
5
6
7
8

library(fingerprint)
library(snow)
cl <- makeCluster(4, type = "SOCK")
clusterEvalQ(cl, library(fingerprint))
clusterExport(cl, "fps")
sim <- parLapply(cl, fps, function(x) {
unlist(lapply(fps, function(y) distance(x,y)))
})

« Previous
1
…
15
16
17
18
19
…
45
Next »

So much to do, so little time

Trying to squeeze sense out of chemical data

Lots of Pretty Pictures

Wikipedia Category Hierarchy via N-triples

2nd Call for Papers – ICCS, 2011

Visualizing PAINS SMARTS

Similarity Matrices in Parallel