<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>So much to do, so little time</title>
	<atom:link href="http://blog.rguha.net/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.rguha.net</link>
	<description>Trying to squeeze sense out of chemical data</description>
	<lastBuildDate>Thu, 05 Jan 2012 04:47:26 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Words, Sentences, Fragments &amp; Molecules</title>
		<link>http://blog.rguha.net/?p=997</link>
		<comments>http://blog.rguha.net/?p=997#comments</comments>
		<pubDate>Thu, 05 Jan 2012 04:45:48 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[cheminformatics]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[chembl]]></category>
		<category><![CDATA[cluster]]></category>
		<category><![CDATA[dirichlet]]></category>
		<category><![CDATA[fragment]]></category>
		<category><![CDATA[lda]]></category>
		<category><![CDATA[sar]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=997</guid>
		<description><![CDATA[For some time I have been thinking of the analogy between linguistics  (and text mining of language data) and chemistry, specifically from the point of view of fragments (though, the relationship between the two fields is actually quite long and deep, since many techniques from IR have been employed in cheminformatics). For example, atoms [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">For some time I have been thinking of the analogy between linguistics  (and text mining of language data) and chemistry, specifically from the point of view of fragments (though, the relationship between the two fields is actually quite long and deep, since many techniques from <a href="http://en.wikipedia.org/wiki/Information_retrieval">IR</a> have been employed in cheminformatics). For example, atoms and bonds can be considered an &#8220;alphabet&#8221; for chemical structures. Going one level up, one can consider fragments as words, which can be joined together to form larger structures (with the linguistic analog being sentences). In a <a href="http://www.slideshare.net/rguha/datadrivenlifesciences-thepyramidsmeetthetowerofbabel">talk</a> I gave at the ACS sometime back I compared fragments with <a href="http://en.wikipedia.org/wiki/N-gram">n-grams</a> (though <a href="http://dx.doi.org/10.1021/ci0496797">LINGO</a>&#8217;s are probably a more direct analog).</p>
<p style="text-align: justify;">On these lines I have been playing with text mining and modeling tools in R, mainly via the excellent <a href="http://cran.r-project.org/web/packages/tm/index.html">tm</a> package. One of the techniques I have been playing around with is <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet Allocation</a>. This is a generative modeling approach, allowing one to associate a document (composed of a set of words) with a &#8220;topic&#8221;. Here, a topic is a group of words that have a higher probability of being generated from that topic than another topic. The technique assumes that a document is comprised of a mixture of topics &#8211; as a result, one can assign a document to different topics with different probabilities. There have been a number of applications of LDA in bioinformatics with some applications focusing on topic models as way to cluster objects such as genes [<a href="http://bioinformatics.oxfordjournals.org/content/21/15/3286.short">1</a>, <a href="http://www.springerlink.com/content/ut64836027727626/">2</a>], whereas others have used it in the more traditional document grouping context [<a href="http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0017243">3</a>].</p>
<p style="text-align: justify;">In text mining scenario, developing an LDA model for a set of documents is relatively straightforward (in R) &#8211; perform a series of pre-processing steps (mainly to standardize the text) such as converting everything to lower case, removing <a href="http://en.wikipedia.org/wiki/Stop_words">stopwords</a> and so on. At the end of this one has a series of documents, each one being represented as a bag of words. The collection of words across all documents can be converted to a document-term matrix (documents in the rows, words in the columns) which is then used as input to the LDA routine.</p>
<p style="text-align: justify;">Those familiar with building predictive models with keyed fingerprints will find this quite familiar &#8211; the individual bit positions represent structural fragments, thus are the chemical analogs of words. Based on this observation I wondered what I would get (and what it would mean) by applying a technique like LDA to a collection structures and their fragments.</p>
<p style="text-align: justify;">My initial thought is that the use of LDA to determine a set of topics for a collection of chemical structures is essentially a clustering of the molecules, with the terms associated with the topics being representative substructures for that &#8220;cluster&#8221;. With these topics in hand, it wil be interesting to see what (or whether) properties (physical, chemical , biological) may be correlated with the clusters/topics identified. The rest of this post describes a quick first look at this, using <a href="https://www.ebi.ac.uk/chembl/">ChEMBL</a> as the source of structures and R for performing pre-processing and modeling.</p>
<h2>Structures &amp; fragments</h2>
<p style="text-align: justify;">We had previously fragmented ChEMBL (v8) in house, so obtaining the data was just a matter of running an SQL query to identify all fragments that occured in 50 or molecules and retrieving their structures and the molecules they were associated with. This gives us 190,252 molecules covered by 6,110 fragments. While a traditional text document-based modeling project would involved a series of pre-processing steps, the only one I need to perform in this scenario is the removal of small (and thus likely very common) fragments such as benzene &#8211; the cheminformatics equivalent of removing stopwords. (Ideally I would also remove fragments that already occur in other fragments &#8211; the cheminformatics equivalent of <a href="http://en.wikipedia.org/wiki/Stemming">stemming</a>)</p>
<p style="text-align: justify;">The data file I have is of the form</p>
<div class="codecolorer-container rsplus twitlight" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br /></div></td><td><div class="rsplus codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">fragment_id, molregno, smiles, natom</div></td></tr></tbody></table></div>
<p>where natom is the number of atoms in the fragment. The R code to generate (relatively) clean data, read to feed to the LDA function looks like:</p>
<div class="codecolorer-container rsplus twitlight" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br /></div></td><td><div class="rsplus codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">frags <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">read.<span style="">table</span></span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'chembl.data'</span>, header<span style="color: #080;">=</span>TRUE, as.<span style="">is</span><span style="color: #080;">=</span>TRUE, <span style="color: #0000FF; font-weight: bold;">comment</span><span style="color: #080;">=</span><span style="color: #ff0000;">''</span>, sep<span style="color: #080;">=</span><span style="color: #ff0000;">','</span><span style="color: #080;">&#41;</span><br />
<span style="color: #0000FF; font-weight: bold;">names</span><span style="color: #080;">&#40;</span>frags<span style="color: #080;">&#41;</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">c</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'fid'</span>, <span style="color: #ff0000;">'molid'</span>, <span style="color: #ff0000;">'smiles'</span>, <span style="color: #ff0000;">'natom'</span><span style="color: #080;">&#41;</span><br />
frags <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">subset</span><span style="color: #080;">&#40;</span>frags, natom <span style="color: #080;">&amp;</span>gt<span style="color: #080;">;=</span> <span style="color: #ff0000;">8</span><span style="color: #080;">&#41;</span><br />
<span style="color: #228B22;">## now we create the &quot;documents&quot;</span><br />
tmp <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">by</span><span style="color: #080;">&#40;</span>frags, frags$molid, <span style="color: #0000FF; font-weight: bold;">function</span><span style="color: #080;">&#40;</span>x<span style="color: #080;">&#41;</span> <span style="color: #0000FF; font-weight: bold;">return</span><span style="color: #080;">&#40;</span> <span style="color: #0000FF; font-weight: bold;">c</span><span style="color: #080;">&#40;</span>x$molid<span style="color: #080;">&#91;</span><span style="color: #ff0000;">1</span><span style="color: #080;">&#93;</span>, join<span style="color: #080;">&#40;</span>x$smiles, <span style="color: #ff0000;">' '</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span><br />
tmp <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">data.<span style="">frame</span></span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">do.<span style="">call</span></span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'rbind'</span>, tmp<span style="color: #080;">&#41;</span>, stringsAsFactors<span style="color: #080;">=</span>FALSE<span style="color: #080;">&#41;</span><br />
<span style="color: #0000FF; font-weight: bold;">names</span><span style="color: #080;">&#40;</span>tmp<span style="color: #080;">&#41;</span> <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">c</span><span style="color: #080;">&#40;</span><span style="color: #ff0000;">'title'</span>, <span style="color: #ff0000;">'text'</span><span style="color: #080;">&#41;</span></div></td></tr></tbody></table></div>
<p>In the code above, we rearrange the data to create &#8220;documents&#8221; &#8211; identified by a title (the molecule identifier) with the body of the document being the space concatenated SMILES for the fragments associated with that molecule. In other words, a molecule (document) is constructed from a set of fragments (words). With the data arranged in this form we can go ahead and reuse code from the <a href="http://cran.r-project.org/web/packages/tm/index.html">tm</a> and <a href="http://cran.r-project.org/web/packages/topicmodels/">topicmodels</a> packages.</p>
<div class="codecolorer-container rsplus twitlight" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br /></div></td><td><div class="rsplus codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #228B22;">## Get a document-term matrix</span><br />
<span style="color: #0000FF; font-weight: bold;">library</span><span style="color: #080;">&#40;</span>tm<span style="color: #080;">&#41;</span><br />
corpus <span style="color: #080;">&lt;-</span> Corpus<span style="color: #080;">&#40;</span>VectorSource<span style="color: #080;">&#40;</span>tmp$text<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span><br />
dtm <span style="color: #080;">&lt;-</span> DocumentTermMatrix<span style="color: #080;">&#40;</span>corpus, control <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">list</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">tolower</span><span style="color: #080;">=</span>FALSE<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span></div></td></tr></tbody></table></div>
<p>Finally, we&#8217;re ready to develop some models, starting of with 6 topics.</p>
<div class="codecolorer-container rsplus twitlight" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br /></div></td><td><div class="rsplus codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000FF; font-weight: bold;">library</span><span style="color: #080;">&#40;</span>topicmodels<span style="color: #080;">&#41;</span><br />
SEED <span style="color: #080;">&lt;-</span> <span style="color: #ff0000;">1234</span><br />
lda.<span style="">model</span> <span style="color: #080;">&lt;-</span> LDA<span style="color: #080;">&#40;</span>dtm, k<span style="color: #080;">=</span><span style="color: #ff0000;">6</span>, control<span style="color: #080;">=</span><span style="color: #0000FF; font-weight: bold;">list</span><span style="color: #080;">&#40;</span>seed<span style="color: #080;">=</span>SEED<span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span></div></td></tr></tbody></table></div>
<p>So, what are the topics that have been identified? As I noted above, each topic is really a set of &#8220;words&#8221; that have a higher probability of being generated by that topic. In the case of this model we obtain the following top 4 fragments associated with each topic (most likely fragments are at the top of the table):</p>
<p style="text-align: justify;"><a href="http://blog.rguha.net/wp-content/uploads/2012/01/terms-6.png"><img class="aligncenter size-medium wp-image-1001" title="terms-6" src="http://blog.rguha.net/wp-content/uploads/2012/01/terms-6.png" alt="" width="300" height="209" /></a></p>
<p style="text-align: justify;">Visual inspection clearly suggests distinct differences in the topics &#8211; topic 1 appears to be characterized primarily by the lack of aromaticity, whereas topic 2 appears to be characterized by quinoline and indole type structures. This is just a rough inspection of the most likely &#8220;terms&#8221; for each topic. It&#8217;s also interesting to look at how the molecules (a.k.a., documents) are assigned to the topics. The barchart indicates the distribution of molecules amongst the 6 topics. <a href="http://blog.rguha.net/wp-content/uploads/2012/01/m6-counts.png"><img class="alignright size-full wp-image-1000" style="margin-left: 5px; margin-right: 5px;" title="m6-counts" src="http://blog.rguha.net/wp-content/uploads/2012/01/m6-counts.png" alt="" width="300" height="300" /></a></p>
<p style="text-align: justify;">As with other unsupervised clustering methods, the choice of k (i.e., the number of topics) is tricky. <em>A priori</em> there is no reason to choose one over the other. Blei in his <a href="http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf">original paper</a> used &#8220;perplexity&#8221; as a measure of the models generalizability (smaller values are better). In this case, we can vary k and evaluate the perplexity: with 6 topics the perplexity is 1122, with 12 topics it drops to 786 and with 100 topics it drops to 308 &#8211; you can see that it seems to continuously decrease with increase in number of topics (which has been <a href="https://lists.cs.princeton.edu/pipermail/topic-models/2008-February/000174.html">observed elsewhere</a>, though in my case, the hyperparameters are kept constant). <a href="http://www.cs.umass.edu/~mimno/papers/wallach09evaluation.pdf">Wallach et al</a> have discussed various approaches to evaluating topic models.</p>
<p style="text-align: justify;">Numerical evaluation of these models is useful, but we&#8217;re more interested in how these assignments correlate with chemical or biological features. First, one could look at the structural homogenity of the molecules assigned to topics. For k = 6, this is probably not useful, as the individual groups are very large. With k = 100 one obtains a much more sensible estimate of homogeneity (but this is to be expected). Another way to evaluate the topics from chemical point of view is to look at some property or activity. Given that ChEMBL provides assay and target information for the molecules, we have many ways to perform this evaluation. As a brief example, we can consider activity distrbutions derived from the molecules associated with each topic. Most ChEMBL molecules have multiple activities associated with them as many are tested in multiple assays. To allow comparison we converted activities in a given assay to Z-scores, allow comparison of activitives across assays. Then for each molecule, we identified the minimum activity, only considering those activities that were annotated as IC<sub>50</sub> and as exact (i.e., not &lt; or &gt;). After removal of a few extreme outliers we obtain:</p>
<p style="text-align: justify;"><a href="http://blog.rguha.net/wp-content/uploads/2012/01/topic-activity.png"><img class="alignleft size-medium wp-image-1002" style="border-image: initial; margin: 5px;" title="topic-activity" src="http://blog.rguha.net/wp-content/uploads/2012/01/topic-activity-300x200.png" alt="" width="300" height="200" /></a></p>
<p style="text-align: justify;">Clearly, within each group, the Z-scores cluster tightly around 0. It appears that the groups differentiate from each other in terms of the extreme values. Indeed plotting summary statistics for each group confirms this &#8211; in fact the median Z-score has a range of 0.05 and the mean Z-score a range of 0.11 across the six groups. In other words, the bulk of the groups are quite similar.</p>
<h2>Other possibilities</h2>
<p style="text-align: justify;">The example shown here is rather simplistic and is the equivalent of unsupervised clustering. One obvious next step is to search the parameter space of the LDA model, evaluate different approaches to estimating the posterior distribution (<a href="http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm">EM</a> or <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs sampling</a>) and so on. A number of extensions to the basic LDA technique have been proposed, one of them being a supervised form of LDA.</p>
<p style="text-align: justify;">It&#8217;d also be useful to look at this method on a slightly smaller, labeled dataset &#8211; I&#8217;ve run some preliminary experiments on the <a href="http://www.cheminformatics.org/datasets/bursi/">Bursi AMES</a> but those results need a little more work. More generally, smaller datasets can be problematic as the number of unique fragments can be low. In addition fewer observations means that the estimates of the posterior distribution becomes fuzzier. One way around this is to develop a model on something like the ChEMBL dataset I used here and then apply that to smaller datasets. Obviously, this goes towards ideas of <a href="http://en.wikipedia.org/wiki/Applicability_Domain">applicability</a> &#8211; but given the size of ChEMBL, it may indeed &#8220;cover&#8221; many smaller datasets.</p>
<h2>Is this useful?</h2>
<p style="text-align: justify;">At first sight, it&#8217;s an interesting method that identifies groupings in an unsupervised manner. Of course, one could easily run <a href="http://en.wikipedia.org/wiki/K-means_clustering">k-means</a> or any of the hierarchical clustering methods to achieve the same result. However, the generative aspect of LDA models is what is of interest to me, but also seems the part that is difficult to map to a chemical setting &#8211; unlike topics in a document, which one can (usually) understand based on the likely terms for that topic, it&#8217;s not clear what a topic is for a collection of molecules in an unsupervised setting. And then, how does one infer the meaning of a topic from fragments? While it&#8217;s certainly true that certain fragments are associated with specific properties/activities, this is certainly not a given (unlike words, where each one does have an individual meaning). Furthermore, in an unsupervised setting like the one I&#8217;ve described here, fishing for a correlation between (some set of) properties and groupings of molecules is probably not the way to go. </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=997</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8220;Type-ahead&#8221; substructure searches</title>
		<link>http://blog.rguha.net/?p=993</link>
		<comments>http://blog.rguha.net/?p=993#comments</comments>
		<pubDate>Mon, 28 Nov 2011 16:07:48 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[cheminformatics]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[fingerprint]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[jquery]]></category>
		<category><![CDATA[screen]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[substructure]]></category>
		<category><![CDATA[typeahead]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=993</guid>
		<description><![CDATA[The other day I was exchanging emails with John Van Drie regarding open challenges in cheminformatics (which I&#8217;ll say more about later). One of his comments concerned the slow speed of chemical searches
Google searches are screamingly fast, so fast that the type-ahead feature is doing the search as you key characters in.  Why are all chemical searches [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">The other day I was exchanging emails with <a href="http://www.vandrieresearch.com/">John Van Drie</a> regarding open challenges in cheminformatics (which I&#8217;ll say more about later). One of his comments concerned the slow speed of chemical searches</p>
<blockquote><p>Google searches are screamingly fast, so fast that the type-ahead feature is doing the search as you key characters in.  Why are all chemical searches so sloooow? &#8230; Ideally, as you sketch your mol in, the searches should be happening at the same pace, like the typeahead feature.</p></blockquote>
<p style="text-align: justify;">Now, he doesn&#8217;t specifically mention what type of chemical search &#8211; it could be exact matches, similarity searches, substructure or pharmacophore searches. The first two can be done very quickly and lend themselves easily to type ahead type search interfaces. In light of the work my colleague <a href="http://tripod.nih.gov:8207/pcs/">has been doing</a>, the substructure searches are now also amenable to a type ahead interface.</p>
<p style="text-align: justify;">So I quickly put together a simple<a href="http://rguha.net/code/java/otf.html"> web page</a> that lets you type in a SMILES (or SMARTS) and as you type it retrieves the results of a substructure search via the <a href="http://tripod.nih.gov:8207/pcs/">NCTT Search Server</a> REST API. (In some cases the depiction is broken &#8211; that&#8217;s a bug on my side). Of course, typing in SMILES is not the most intuitive of interfaces. Since Trung employs the <a href="http://web.chemdoodle.com/">ChemDoodle</a> <a href="http://web.chemdoodle.com/demos/sketcher">sketcher</a>, an ideal interface would respond to drawing events (say drawing a bond or adding atoms etc) and pull up matches on the fly. Another obvious extension is to rank (or filter) the results &#8211; all the while, maintaining the near real time speed of the application.</p>
<p style="text-align: justify;">As I said <a href="http://blog.rguha.net/?p=983">before</a>, seriously fast substructure searches. It also helps that I can build these examples via a public REST API. I&#8217;m sure there are reasons for <a href="http://www.chemspider.com/blog/how-to-use-chemspider-webservices-when-programming-with-java.html">SOAP</a>, <a href="http://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html">XML</a> and so on. But it&#8217;s 2011. So lets help make extensions and mashups easier.</p>
<p style="text-align: justify;"><strong>UPDATE</strong>: Yes, it&#8217;s easy to create patterns (especially with SMARTS) that DoS the server. We have some filters for excessively generic patterns; so some queries may not behave in the expected manner</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=993</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Substructure Searches &#8211; High Speed, Large Scale</title>
		<link>http://blog.rguha.net/?p=983</link>
		<comments>http://blog.rguha.net/?p=983#comments</comments>
		<pubDate>Wed, 23 Nov 2011 01:09:15 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[cheminformatics]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[bloom]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[fingerprint]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[substructure]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=983</guid>
		<description><![CDATA[My NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won&#8217;t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available here). He&#8217;s provided a web interface allowing one to draw in substructure queries or [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">My <a href="http://nctt.nih.gov">NCTT</a> colleague, Trung Nguyen, recently <a href="http://tripod.nih.gov/?p=361">announced</a> a <a href="http://tripod.nih.gov/pcs/">prototype</a> chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won&#8217;t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available <a href="http://tripod.nih.gov/files/search-server.zip">here</a>). He&#8217;s provided a <a href="http://tripod.nih.gov/pcs/">web interface</a> allowing one to draw in substructure queries or specify SMILES or SMARTS patterns, and then search for substructures across a snapshot of PubChem (more than 30M structures).</p>
<p style="text-align: justify;"><em>It is blazingly fast.</em></p>
<p style="text-align: justify;">I decided to run some benchmarks via the REST interface that he provided, using a set of 1000 SMILES derived from an in-house fragmentation of the <a href="http://mlsmr.glpg.com/MLSMR_HomePage/">MLSMR</a>. The 1000 structure subset is available <a href="http://blog.rguha.net/wp-content/uploads/2011/11/frags1k.txt">here</a>. For each query structure I record the number of hits, time required for the query and the number of atoms in the query structure. The number of atoms in the query structures ranged from 8 to 132, with a median of 16 atoms.</p>
<p style="text-align: justify;">The figure below shows the distribution of hits matching the query and the time required to perform the query (on the server) for the 1000 substructures. Clearly, the bulk of the queries take less than 1 sec, even though the result set can contain more than 10,000 hits.</p>
<p style="text-align: justify;"><a href="http://blog.rguha.net/wp-content/uploads/2011/11/nhit-time-hist.png"><img class="aligncenter size-full wp-image-986" title="nhit-time-hist" src="http://blog.rguha.net/wp-content/uploads/2011/11/nhit-time-hist.png" alt="" width="700" height="350" /></a></p>
<p style="text-align: justify;">The figures below provide another look. On the left, I plot the number of hits versus the size of the query. As expected, the number of matches drops of with the size of the query. We also observe the expected trend between query times and the size of the result sets. Interestingly, while not a fully linear relationship, the slope of the curve is quite low. Of course, these times do not include retrieval times (the structures themselves are stored in an Oracle database and must be retrieved from there) and network transfer times.</p>
<p style="text-align: justify;"><a href="http://blog.rguha.net/wp-content/uploads/2011/11/nhit-time-natom-xyplot.png"><img class="aligncenter size-full wp-image-987" title="nhit-time-natom-xyplot" src="http://blog.rguha.net/wp-content/uploads/2011/11/nhit-time-natom-xyplot.png" alt="" width="700" height="350" /></a></p>
<p style="text-align: justify;">Finally, I was also interested in getting an idea of the number of hits returned for a given size of query structure. The figure below summarizes this data, highlighting the variation in result set size for a given number of query atoms. Some of these are not valid (e.g., query structures with 35, 36, &#8230; atoms) as there were just a single query structure with that number of atoms.</p>
<p style="text-align: center;"><a href="http://blog.rguha.net/wp-content/uploads/2011/11/natom-nhit-bwplot.png"><img class="aligncenter size-full wp-image-989" title="natom-nhit-bwplot" src="http://blog.rguha.net/wp-content/uploads/2011/11/natom-nhit-bwplot.png" alt="" width="700" height="262" /></a></p>
<p style="text-align: justify;">
<p style="text-align: justify;">Overall, very impressive. And it&#8217;s something you can play with yourself.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=983</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Another Oracle Structure Search Cartridge</title>
		<link>http://blog.rguha.net/?p=981</link>
		<comments>http://blog.rguha.net/?p=981#comments</comments>
		<pubDate>Thu, 10 Nov 2011 23:00:22 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[Literature]]></category>
		<category><![CDATA[cheminformatics]]></category>
		<category><![CDATA[abcd]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[smarts]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=981</guid>
		<description><![CDATA[I came across an ASAP paper today describing substructure searching in Oracle databases. The paper comes from the folks at J &#38; J and is part of their series of papers on the ABCD platform. Performing substructure searches in databases is certainly not a new topic and  various products are out there that support [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">I came across an <a href="http://pubs.acs.org/doi/pdf/10.1021/ci200413e">ASAP paper</a> today describing substructure searching in Oracle databases. The paper comes from the folks at J &amp; J and is part of their series of papers on the <a href="http://dx.doi.org/10.1021/ci700267w">ABCD</a> platform. Performing substructure searches in databases is certainly not a new topic and  <a href="http://www.chemaxon.com/products/jchem-cartridge/">various</a> <a href="http://www.daylight.com/meetings/mug00/Delany/cartridge.html">products</a> <a>are</a> out there that support this in Oracle (as well as other RDBMSs). The paper describes how the ABCD system does this using a combination of structure-derived hash keys and an inverted bitset based index and discuss their implementation as an Oracle cartridge. They provide an interesting discussion of how their implementation supports <a href="http://download.oracle.com/docs/cd/B10501_01/server.920/a96533/optimops.htm">Cost Based Optimization</a> of SQL queries involving substructure search. The authors run a number of benchmarks. In terms of comparative benchamrks they compare the performance (i.e., screening efficiency) of their hashed keys versus MACCS keys, CACTVS keys and OpenBabel FP2 fingerprints.  Their results indicate that the screening step is a key bottleneck in the query process and that their hash key is generally more selective than the others.</p>
<p style="text-align: justify;">Unfortunately, what would have been interesting but was not provided was a comparison of the performance at the Oracle query level with other products such as <a href="http://www.chemaxon.com/products/jchem-cartridge/">JChem Cartridge</a> and  <a href="http://orchem.sourceforge.net/">OrChem</a>. Furthermore, the test case is just under a million molecules from <a href="http://dx.doi.org/10.1021/ci8003013">Golovin &amp; Henrick</a> &#8211; the entire dataset (not just the keys) could probably reside in-memory on todays servers. How does the system perform when say faced with PubChem (34 million molecules)? The paper mentions a command line implementation of their search procedure, but as far as I can tell, the Oracle cartridge is not available.</p>
<p style="text-align: justify;">The ABCD system has many useful and interesting features. But as with the other publications on this system, this paper is one more in the line of &#8220;<a href="http://rguha.wordpress.com/2009/02/01/papers-about-systems-you-cant-use-or-buy/">Papers About Systems You Can’t Use or Buy</a>&#8220;. Unfortunate.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=981</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cheminformatics and Clam Chowder</title>
		<link>http://blog.rguha.net/?p=978</link>
		<comments>http://blog.rguha.net/?p=978#comments</comments>
		<pubDate>Mon, 25 Jul 2011 02:35:06 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=978</guid>
		<description><![CDATA[The time has come to move again &#8211; though, in this case, it&#8217;s just a geographic move. From August I&#8217;ll be living in Manchester, CT (great cheeseburgers and lovely cycle routes) and will continue to work remotely for NCGC. I&#8217;ll be travelling to DC every month or so. The rest of the time I&#8217;ll be [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">The time has come to move again &#8211; though, in this case, it&#8217;s just a geographic move. From August I&#8217;ll be living in <a href="http://www.townofmanchester.org/">Manchester, CT</a> (great <a href="http://www.yelp.com/biz/shady-glen-dairy-stores-manchester-2">cheeseburgers</a> and lovely cycle routes) and will continue to work remotely for <a href="http://ncgc.nih.gov/">NCGC</a>. I&#8217;ll be travelling to DC every month or so. The rest of the time I&#8217;ll be working from Connecticut.</p>
<p style="text-align: justify;">Being new to the area, it&#8217;d be great to meet up over a beer, with people in the surrounding areas (NY/CT/RI) doing cheminformatics, predictive modeling and other life science related topics (any R user groups in the area?). If anybody&#8217;s interested, drop me a line (comment, <a href="mailto:rajarshi.guha@gmail.com">mail</a> or <a href="http://twitter.com/#!/rguha">@rguha</a>).</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=978</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A New Round of Lightning Talks</title>
		<link>http://blog.rguha.net/?p=975</link>
		<comments>http://blog.rguha.net/?p=975#comments</comments>
		<pubDate>Fri, 22 Jul 2011 04:28:56 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[cheminformatics]]></category>
		<category><![CDATA[acs]]></category>
		<category><![CDATA[cinf]]></category>
		<category><![CDATA[lightning]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=975</guid>
		<description><![CDATA[With the 2011 Fall ACS meeting coming up in Denver next month, CINF will be hosting another round of lightning talks &#8211; 8 minutes to talk about anything related to cheminformatics and chemical information. As before, these talks won&#8217;t be managed via PACS, as a result of which we are taking short abstracts between July [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">With the 2011 Fall ACS meeting coming up in Denver next month, CINF will be hosting another round of lightning talks &#8211; 8 minutes to talk about anything related to cheminformatics and chemical information. As before, these talks won&#8217;t be managed via PACS, as a result of which we are taking short abstracts between July 14 and Aug 14.We hope that we&#8217;ll get to hear about interesting and recent stuff. Remember, this is meant to be a fun event so be creative! (You can see <a href="http://rguha.net/cinftmp/flash.html">slides</a> from the first run of this session last year).</p>
<p style="text-align: justify;">The full announcement is below:</p>
<blockquote><p>For the 2011 <a href="http://portal.acs.org/portal/acs/corg/content?_nfpb=true&amp;_pageLabel=PP_SUPERARTICLE&amp;node_id=516&amp;use_sec=false&amp;sec_url_var=region1&amp;__uuid=4fba804a-fdf3-49d8-a891-5c848c3b1339">Fall meeting</a> in Denver (Aug 28 &#8211; Sep 1), CINF will be running an experimental session of lightning talks &#8211; short, strictly timed talks. The session does not have a specific topic, however, all talks should be related to cheminformatics and chemical information. One of the key features of this session is that we will not be using the traditional <a href="http://abstracts.acs.org/">ACS abstract submission system</a>, since that system precludes the inclusion of recent work in the program.</p>
<p>So, since we will be accepting abstracts directly, the expectation is that they be about recent work and developments, rather than rehashes of year-old work. In addition, talks should not be verbal versions of posters submitted for this meeting. Given the short time limits we don&#8217;t expect great detail &#8211; but we are expecting compact and informative presentations.</p>
<p>That&#8217;s the challenge.</p>
<h3>What</h3>
<ul>
<li>Talks should be no longer than 8 minutes in length. At 8 minutes, you will be asked to stop.</li>
<li>Use as many slides as you want, as long as you can finish in 8 minutes</li>
<li>Talks should not be rehashes of poster presentations</li>
<li>Talks will run back to back, and questions &amp; discussion will be held of off until the end</li>
</ul>
<p>If you haven&#8217;t participated in these types of talks before here are some suggestions:</p>
<ul>
<li>No more than three slides for a 5 minute talk (but if you can pull of 20 slides in 8 minutes, more power to you)</li>
<li>Avoid slides with too much text (and don&#8217;t paste PDF&#8217;s of papers!)</li>
<li>A single chart per slide and make sure labels are readable at a distance</li>
</ul>
<h3>When</h3>
<p><strong>1:30pm, Wednesday, August 31st, 2011</strong></p>
<p>Submissions run from <strong>July 14 to Aug 14</strong></p>
<h3>Where</h3>
<p><strong>Room 112, Colorado Convention Center</strong></p>
<h3>How</h3>
<ul>
<li>Send in an abstract of about 100 &#8211; 120 words to cinf.flash@gmail.com</li>
<li>We will let you know if you will be speaking by <strong>Aug 21</strong> and we will need slide decks by <strong>Aug 24</strong></li>
<li>You must be registered for the meeting</li>
<li>Note that the usual publication/copyright rules apply</li>
<li>We will encourage live blogging and tweets (if we have net access)</li>
</ul>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=975</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New Versions of rcdk &amp; rcdklibs</title>
		<link>http://blog.rguha.net/?p=970</link>
		<comments>http://blog.rguha.net/?p=970#comments</comments>
		<pubDate>Sat, 18 Jun 2011 19:41:51 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[cheminformatics]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[cdk]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=970</guid>
		<description><![CDATA[With the recent stable release of the CDK (1.3.12) and the inclusion of the new rendering classes, I was able to make a new release of the rcdk (3.1.1) and rcdklibs (1.3.11) packages that support cheminformatics in R. They&#8217;ve been pushed to CRAN and should be visible in a day or two. The new features [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">With the recent <a href="http://chem-bla-ics.blogspot.com/2011/06/httpchem-bla-icsblogspotcom201106cdk.html">stable release</a> of the CDK (1.3.12) and the inclusion of the new rendering classes, I was able to make a new release of the <a href="http://cran.r-project.org/web/packages/rcdk/index.html">rcdk</a> (3.1.1) and <a href="http://cran.r-project.org/web/packages/rcdklibs/index.html">rcdklibs</a> (1.3.11) packages that support cheminformatics in R. They&#8217;ve been pushed to CRAN and should be visible in a day or two. The new features in the latest version of rcdk include</p>
<ul>
<li>Directly evaluate molecular volume (based on <a href="http://dx.doi.org/10.1021/jo034808o">group contributions</a>) using <strong>get.volume</strong></li>
<li>Generate fingerprints using the hybridization state</li>
<li><strong>get.total.charge</strong> and <strong>get.total.formal.charge</strong> work sensibly</li>
<li>Added a function (<strong>copy.image.to.clipboard</strong>) that copies the 2D depiction of a molecule to the system clipboard in PNG format</li>
<li>Now, OS X users can view and copy molecule depictions. This is slower compared to the same operation on Windows or Linux since it involves shell&#8217;ing out via <strong>system</strong>. But it is better than not being able to view anything.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=970</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The CDK Volume Descriptor</title>
		<link>http://blog.rguha.net/?p=966</link>
		<comments>http://blog.rguha.net/?p=966#comments</comments>
		<pubDate>Fri, 17 Jun 2011 23:42:30 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[cheminformatics]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[cdk]]></category>
		<category><![CDATA[descriptor]]></category>
		<category><![CDATA[qsar]]></category>
		<category><![CDATA[volume]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=966</guid>
		<description><![CDATA[Sometime back Egon implemented a simple group contribution based volume calculator and it made its way into the stable branch (1.4.x) today. As a result I put out a new version of the CDKDescUI which includes a descriptor that wraps the new volume calculator as well as the hybridization fingerprinter that Egon also implemented recently. [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Sometime back <a href="http://chem-bla-ics.blogspot.com/">Egon</a> implemented a simple <a href="http://dx.doi.org/10.1021/jo034808o">group contribution based volume</a> calculator and it <a href="http://chem-bla-ics.blogspot.com/2011/06/fast-calculation-of-van-der-waals.html?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%3A+blogspot%2FmpIP+%28chem-bla-ics%29">made its way into</a> the stable branch (1.4.x) today. As a result I put out a new version of the <a href="http://rguha.net/code/java/cdkdesc.html">CDKDescUI</a> which includes a descriptor that wraps the new volume calculator as well as the hybridization fingerprinter that Egon also implemented recently. The volume descriptor (based on the VABCVolume class) is one that has been missing for the some time (though the NumericalSurface class did return a volume, but it&#8217;s slow). This class is reasonably fast (10,000 molecules processed in 32 sec) and correlates well with the 2D and pseudo-3D volume descriptors from MOE (2008.10) as shown below. As expected the correlation is better with the 2D version of the descriptor (which is similar in nature to the lookup method used in the CDK version). The X-axis represents the CDK descriptor values.</p>
<p style="text-align: justify;"><a href="http://blog.rguha.net/wp-content/uploads/2011/06/vol-moe-cdk.png"><img class="aligncenter size-full wp-image-967" title="vol-moe-cdk" src="http://blog.rguha.net/wp-content/uploads/2011/06/vol-moe-cdk.png" alt="" width="500" height="300" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=966</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>New Version of fingerprint</title>
		<link>http://blog.rguha.net/?p=962</link>
		<comments>http://blog.rguha.net/?p=962#comments</comments>
		<pubDate>Fri, 03 Jun 2011 00:13:21 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[cheminformatics]]></category>
		<category><![CDATA[chemfp]]></category>
		<category><![CDATA[fingerprint]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=962</guid>
		<description><![CDATA[I&#8217;ve submitted version 3.4.3 of the fingerprint package to CRAN, so it should be available in a day or two. It&#8217;s an R package that lets you read in (chemical structure) fingerprint data from a variety of sources (CDK, MOE, BCI etc) and perform a variety of operations (bitwise, similarity, etc.) and visualizations on them.
The two [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">I&#8217;ve submitted version 3.4.3 of the <a href="http://cran.r-project.org/web/packages/fingerprint/index.html">fingerprint</a> package to CRAN, so it should be available in a day or two. It&#8217;s an R package that lets you read in (chemical structure) fingerprint data from a variety of sources (CDK, MOE, BCI etc) and perform a variety of operations (bitwise, similarity, etc.) and visualizations on them.</p>
<p style="text-align: justify;">The two main additions to this version are</p>
<ul style="text-align: justify;">
<li>Read support for the new <a href="http://code.google.com/p/chem-fingerprints/wiki/FPS">FPS</a> fingerprint format described by <a href="http://www.dalkescientific.com/writings/diary/">Andrew Dalke</a> at the <a href="http://code.google.com/p/chem-fingerprints/">chemfp</a> project. Note, it currently discards some of header information</li>
<li>The fingerprint class now has a field, <em>misc, (</em>a <a href="http://cran.r-project.org/doc/manuals/R-lang.html#List-objects">list</a>) that allows one to read in extra, arbitrary data that might be provided along with a fingerprint. Exactly what gets stored in this field depends on the line function used to read in the fingerprint data. Currently only the FPS parser returns extra data (when available) in this field.</li>
</ul>
<p style="text-align: justify;">As always, you can get the package source directly from the Github <a href="https://github.com/rajarshi/cdkr">repository</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=962</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Accessing High Content Data from R</title>
		<link>http://blog.rguha.net/?p=951</link>
		<comments>http://blog.rguha.net/?p=951#comments</comments>
		<pubDate>Fri, 27 May 2011 05:01:58 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[hci]]></category>
		<category><![CDATA[hcs]]></category>
		<category><![CDATA[metaxpress]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=951</guid>
		<description><![CDATA[Over the last few months I&#8217;ve been getting involved in the informatics &#38; data mining aspects of high content screening. While I haven&#8217;t gotten into image analysis itself (there&#8217;s a ton of good code and tools already out there), I&#8217;ve been focusing on managing image data and meta-data and asking interesting questions of the voluminuous, [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Over the last few months I&#8217;ve been getting involved in the informatics &amp; data mining aspects of high content screening. While I haven&#8217;t gotten into image analysis itself (there&#8217;s a ton of good code and tools already out there), I&#8217;ve been focusing on managing image data and meta-data and asking interesting questions of the voluminuous, high-dimensional data that is generated by these techniques.</p>
<p style="text-align: justify;">One of our platforms is <a href="http://www.moleculardevices.com/Products/Instruments/High-Content-Screening/ImageXpress-Micro.html">ImageXpress</a> from <a href="http://www.moleculardevices.com/">Molecular Devices</a>, which stores images in a file-based image store and meta data and numerical image features in an Oracle database. While they do provide an API to interact with the database it&#8217;s a Windows only DLL. But since much of modeling requires I access the data from <a href="http://www.r-project.org/">R</a>, I needed a more flexible solution.</p>
<p style="text-align: justify;">So, I&#8217;ve put together an R <a href="http://blog.rguha.net/wp-content/uploads/2011/05/ncgchcs_0.93.tar.gz">package</a> that allows one to access numeric image data (i.e., descriptors) and images themselves. It depends on the <a href="http://cran.r-project.org/web/packages/ROracle/index.html">ROracle</a> package (which in turns requires an Oracle client installation).</p>
<p style="text-align: justify;">Currently the functionality is relatively limited, focusing on my common tasks. Thus for example, given assay plate barcodes, we can retrieve the assay ids that the plate is associated with and then for a given assay, obtain the cell-level image parameter data (or optionally, aggregate it to well-level data). This task is easily parallelizable &#8211; in fact when processing a high content RNAi screen, I make use of <a href="http://cran.r-project.org/web/packages/snow/index.html">snow</a> to speed up the data access and processing of 50 plates.</p>
<div class="codecolorer-container rsplus twitlight" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br /></div></td><td><div class="rsplus codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000FF; font-weight: bold;">library</span><span style="color: #080;">&#40;</span>ncgchcs<span style="color: #080;">&#41;</span><br />
con <span style="color: #080;">&lt;-</span> get.<span style="">connection</span><span style="color: #080;">&#40;</span>user<span style="color: #080;">=</span><span style="color: #ff0000;">'foo'</span>, passwd<span style="color: #080;">=</span><span style="color: #ff0000;">'bar'</span>, sid<span style="color: #080;">=</span><span style="color: #ff0000;">'baz'</span><span style="color: #080;">&#41;</span><br />
plate.<span style="">barcode</span> <span style="color: #080;">&lt;-</span> <span style="color: #ff0000;">'XYZ1023'</span><br />
plate.<span style="">id</span> <span style="color: #080;">&lt;-</span> get.<span style="">plates</span><span style="color: #080;">&#40;</span>con, plate.<span style="">barcode</span><span style="color: #080;">&#41;</span><br />
<br />
<span style="color: #228B22;">## multiple analyses could be run on the same plate - we need</span><br />
<span style="color: #228B22;">## to get the correct one (MX uses 'assay' to refer to an analysis run)</span><br />
<span style="color: #228B22;">## so we first get details of analyses without retrieving the actual data</span><br />
details <span style="color: #080;">&lt;-</span> get.<span style="">assay</span>.<span style="">by</span>.<span style="">barcode</span><span style="color: #080;">&#40;</span>con, barcode<span style="color: #080;">=</span>plate.<span style="">barcode</span>, dry<span style="color: #080;">=</span>TRUE<span style="color: #080;">&#41;</span><br />
details <span style="color: #080;">&lt;-</span> <span style="color: #0000FF; font-weight: bold;">subset</span><span style="color: #080;">&#40;</span>ret, PLATE_ID <span style="color: #080;">==</span> plate.<span style="">id</span> <span style="color: #080;">&amp;</span> SETTINGS_NAME <span style="color: #080;">==</span> assay.<span style="">name</span><span style="color: #080;">&#41;</span><br />
assay.<span style="">id</span> <span style="color: #080;">&lt;-</span> details$ASSAY_ID<br />
<br />
<span style="color: #228B22;">## finally, get the analysis data, using median to aggregate cell-level data</span><br />
hcs.<span style="">data</span> <span style="color: #080;">&lt;-</span> &nbsp;get.<span style="">assay</span><span style="color: #080;">&#40;</span>con, assay.<span style="">id</span>, aggregate.<span style="">func</span><span style="color: #080;">=</span><span style="color: #0000FF; font-weight: bold;">median</span>, verbose<span style="color: #080;">=</span>FALSE, na.<span style="">rm</span><span style="color: #080;">=</span>TRUE<span style="color: #080;">&#41;</span></div></td></tr></tbody></table></div>
<p style="text-align: justify;">Alternatively, given a plate id (this is the internal MetaXpress plate id) and a well location, one can obtain the path to the relevant image(s). With the images in hand, you could use <a href="http://www.bioconductor.org/packages/release/bioc/html/EBImage.html">EBImage</a> to perform image processing entirely in R.</p>
<div class="codecolorer-container rsplus twitlight" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><table cellspacing="0" cellpadding="0"><tbody><tr><td style="padding:5px;text-align:center;color:#888888;background-color:#EEEEEE;border-right: 1px solid #9F9F9F;font: normal 12px/1.4em Monaco, Lucida Console, monospace;"><div>1<br />2<br />3<br />4<br />5<br />6<br /></div></td><td><div class="rsplus codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap"><span style="color: #0000FF; font-weight: bold;">library</span><span style="color: #080;">&#40;</span>ncgchcs<span style="color: #080;">&#41;</span><br />
<span style="color: #228B22;">## will want to set IMG.STORE.LOC to point to your image store</span><br />
con <span style="color: #080;">&lt;-</span> get.<span style="">connection</span><span style="color: #080;">&#40;</span>user<span style="color: #080;">=</span><span style="color: #ff0000;">'foo'</span>, passwd<span style="color: #080;">=</span><span style="color: #ff0000;">'bar'</span>, sid<span style="color: #080;">=</span><span style="color: #ff0000;">'baz'</span><span style="color: #080;">&#41;</span><br />
plate.<span style="">barcode</span> <span style="color: #080;">&lt;-</span> <span style="color: #ff0000;">'XYZ1023'</span><br />
plate.<span style="">id</span> <span style="color: #080;">&lt;-</span> get.<span style="">plates</span><span style="color: #080;">&#40;</span>con, plate.<span style="">barcode</span><span style="color: #080;">&#41;</span><br />
get.<span style="">image</span>.<span style="">path</span><span style="color: #080;">&#40;</span>con, plate.<span style="">id</span>, <span style="color: #ff0000;">4</span>, <span style="color: #ff0000;">4</span><span style="color: #080;">&#41;</span> <span style="color: #228B22;">## get images for all sites &amp; wavelengths</span></div></td></tr></tbody></table></div>
<p style="text-align: justify;">Currently, you cannot get the internal plate id based on the user assigned plate name (which is usually different from the barcode). Also the documentation is non-existant, so you need to explore the package to learn the functions. If there&#8217;s interest I&#8217;ll put in Rd pages down the line. As a side note, we also have a Java interface to the MetaXpress database that is being used to drive a REST interface to make our imaging data accessible via the web.
</p>
<p style="text-align: justify;">Of course, this is all specific to the ImageXpress platform &#8211; we have others such as InCell and Acumen. To have a comprehensive solution for all our imaging, I&#8217;m looking at the <a href="http://www.openmicroscopy.org/site">OME</a> infrastructure as a means of, at the very least, have a unified interface to the images and their meta data.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=951</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

