<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>So much to do, so little time</title>
	<atom:link href="http://blog.rguha.net/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.rguha.net</link>
	<description>Trying to squeeze sense out of chemical data</description>
	<pubDate>Mon, 08 Feb 2010 02:18:31 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.5</generator>
	<language>en</language>
			<item>
		<title>Molecules &#038; MongoDB - Numbers and Thoughts</title>
		<link>http://blog.rguha.net/?p=472</link>
		<comments>http://blog.rguha.net/?p=472#comments</comments>
		<pubDate>Mon, 08 Feb 2010 02:18:31 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[cheminformatics]]></category>

		<category><![CDATA[software]]></category>

		<category><![CDATA[database]]></category>

		<category><![CDATA[mongodb]]></category>

		<category><![CDATA[nosql]]></category>

		<category><![CDATA[openbabel]]></category>

		<category><![CDATA[python]]></category>

		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=472</guid>
		<description><![CDATA[In my previous post I had mentioned that key/value or non-relational data stores could be useful in certain cheminformatics applications. I had started playing around with MongoDB and following Rich&#8217;s example, I thought I&#8217;d put it through its paces using data from PubChem.
Installing MongoDB was pretty trivial. I downloaded the 64 bit version for OS [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">In my <a href="http://blog.rguha.net/?p=470">previous</a> post I had mentioned that key/value or non-relational data stores could be useful in certain cheminformatics applications. I had started playing around with <a href="http://www.mongodb.org/display/DOCS/Home">MongoDB</a> and following Rich&#8217;s <a href="http://depth-first.com/articles/2010/01/20/pubcouch-a-couchdb-interface-to-pubchem">example</a>, I thought I&#8217;d put it through its paces using data from <a href="http://pubchem.ncbi.nlm.nih.gov/">PubChem</a>.</p>
<p style="text-align: justify;">Installing MongoDB was pretty trivial. I downloaded the <a href="http://downloads.mongodb.org/osx/mongodb-osx-x86_64-1.2.2.tgz">64 bit version for OS X</a>, unpacked it and then simply started the server process:</p>
<div class="codecolorer-container bash " style="overflow:auto;white-space:nowrap;width:685px"><table cellspacing="0" cellpadding="0"><tbody><tr><td class="line-numbers"><div>1<br /></div></td><td><div class="bash codecolorer" style="font-family:Monaco,Lucida Console,monospace"><span class="re1">$MONGO_HOME</span><span class="sy0">/</span>bin<span class="sy0">/</span>mongod <span class="re5">--dbpath</span>=<span class="re1">$HOME</span><span class="sy0">/</span>src<span class="sy0">/</span>mdb<span class="sy0">/</span>db</div></td></tr></tbody></table></div>
<p style="text-align: justify;">where <i>$HOME/src/mdb/db</i> is the directory in which the database will store the actual data. The simplicity is certainly nice. Next, I needed the <a href="http://pypi.python.org/pypi/pymongo/">Python bindings</a>. With <a href="http://pypi.python.org/pypi/setuptools">easy_install</a>, this was quite painless. At this point I had everything in hand to start playing with MongoDB.</p>
<h3><strong>Getting data</strong></h3>
<p style="text-align: justify;">The first step was to get some data from PubChem. This is pretty easy using via their FTP site. I was a bit lazy, so I just made calls to wget, rather than use <a href="http://docs.python.org/library/ftplib.html">ftplib</a>. The code below will retrieve the first 80 PubChem SD files and uncompress them into the current directory.</p>
<div class="codecolorer-container python " style="overflow:auto;white-space:nowrap;width:685px"><table cellspacing="0" cellpadding="0"><tbody><tr><td class="line-numbers"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br /></div></td><td><div class="python codecolorer" style="font-family:Monaco,Lucida Console,monospace"><span class="kw1">import</span> <span class="kw3">glob</span>, <span class="kw3">sys</span>, <span class="kw3">os</span>, <span class="kw3">time</span>, <span class="kw3">random</span>, <span class="kw3">urllib</span><br />
<br />
<span class="kw1">def</span> getfiles<span class="br0">&#40;</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; n = <span class="nu0">0</span><br />
&nbsp; &nbsp; nmax = <span class="nu0">80</span><br />
&nbsp; &nbsp; <span class="kw1">for</span> o <span class="kw1">in</span> <span class="kw3">urllib</span>.<span class="me1">urlopen</span><span class="br0">&#40;</span><span class="st0">'ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/'</span><span class="br0">&#41;</span>.<span class="me1">read</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; o = o.<span class="me1">strip</span><span class="br0">&#40;</span><span class="br0">&#41;</span>.<span class="me1">split</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="br0">&#91;</span><span class="nu0">5</span><span class="br0">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">os</span>.<span class="me1">system</span><span class="br0">&#40;</span><span class="st0">'wget %s/%s'</span> <span class="sy0">%</span> <span class="br0">&#40;</span><span class="st0">'ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/'</span>, o<span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">os</span>.<span class="me1">system</span><span class="br0">&#40;</span><span class="st0">'gzip -d %s'</span> <span class="sy0">%</span> <span class="br0">&#40;</span>o<span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; n += <span class="nu0">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">sys</span>.<span class="me1">stdout</span>.<span class="me1">write</span><span class="br0">&#40;</span><span class="st0">'Got n = %d, %s<span class="es0">\r</span>'</span> <span class="sy0">%</span> <span class="br0">&#40;</span>n,o<span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">sys</span>.<span class="me1">stdout</span>.<span class="me1">flush</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> n == nmax: <span class="kw1">return</span></div></td></tr></tbody></table></div>
<p style="text-align: justify;">This gives us a total of 1,641,250 molecules.</p>
<h3><strong>Loading data</strong></h3>
<p style="text-align: justify;">With the MongoDB instance running, we&#8217;re ready to connect and insert records into it. For this test, I simply loop over each molecule in each SD file and create a record consisting of the PubChem CID and all the SD tags for that molecule. In this context a record is simply a Python dict, with the SD tags being the keys and the tag values being the values. Since i know the PubChem CID is unique in this collection I set the special document key &#8220;_id&#8221; (essentially, the primary key) to the CID. The code to perform this uses the Python bindings to <a href="http://openbabel.org/wiki/Main_Page">OpenBabel</a>:</p>
<div class="codecolorer-container python " style="overflow:auto;white-space:nowrap;width:685px;height:300px"><table cellspacing="0" cellpadding="0"><tbody><tr><td class="line-numbers"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br />14<br />15<br />16<br />17<br />18<br />19<br />20<br />21<br />22<br />23<br />24<br />25<br />26<br />27<br />28<br />29<br />30<br />31<br />32<br />33<br />34<br />35<br />36<br />37<br />38<br />39<br />40<br />41<br />42<br />43<br />44<br />45<br />46<br />47<br /></div></td><td><div class="python codecolorer" style="font-family:Monaco,Lucida Console,monospace"><span class="kw1">from</span> openbabel <span class="kw1">import</span> <span class="sy0">*</span><br />
<span class="kw1">import</span> <span class="kw3">glob</span>, <span class="kw3">sys</span>, <span class="kw3">os</span><br />
<span class="kw1">from</span> pymongo <span class="kw1">import</span> Connection<br />
<span class="kw1">from</span> pymongo <span class="kw1">import</span> DESCENDING<br />
<br />
<span class="kw1">def</span> loadDB<span class="br0">&#40;</span>recreate = <span class="kw2">True</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; conn = Connection<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; db = conn.<span class="me1">chem</span><br />
&nbsp; &nbsp; <span class="kw1">if</span> <span class="st0">'mol2d'</span> <span class="kw1">in</span> db.<span class="me1">collection_names</span><span class="br0">&#40;</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> recreate:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">print</span> <span class="st0">'Deleting mol2d collection'</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; db.<span class="me1">drop_collection</span><span class="br0">&#40;</span><span class="st0">'mol2d'</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">else</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">print</span> <span class="st0">'mol2d exists. Will not reload data'</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">return</span><br />
&nbsp; &nbsp; coll = db.<span class="me1">mol2d</span><br />
<br />
&nbsp; &nbsp; obconversion = OBConversion<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; obconversion.<span class="me1">SetInFormat</span><span class="br0">&#40;</span><span class="st0">&quot;sdf&quot;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; obmol = OBMol<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
<br />
&nbsp; &nbsp; n = <span class="nu0">0</span><br />
&nbsp; &nbsp; files = <span class="kw3">glob</span>.<span class="kw3">glob</span><span class="br0">&#40;</span><span class="st0">&quot;*.sdf&quot;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="kw1">for</span> f <span class="kw1">in</span> files:<br />
&nbsp; &nbsp; &nbsp; &nbsp; notatend = obconversion.<span class="me1">ReadFile</span><span class="br0">&#40;</span>obmol,f<span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">while</span> notatend:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; doc = <span class="br0">&#123;</span><span class="br0">&#125;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; sdd = <span class="br0">&#91;</span>toPairData<span class="br0">&#40;</span>x<span class="br0">&#41;</span> <span class="kw1">for</span> x <span class="kw1">in</span> obmol.<span class="me1">GetData</span><span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="kw1">if</span> x.<span class="me1">GetDataType</span><span class="br0">&#40;</span><span class="br0">&#41;</span>==PairData<span class="br0">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">for</span> entry <span class="kw1">in</span> sdd:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; key = entry.<span class="me1">GetAttribute</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; value = entry.<span class="me1">GetValue</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; doc<span class="br0">&#91;</span>key<span class="br0">&#93;</span> = value<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; doc<span class="br0">&#91;</span><span class="st0">'_id'</span><span class="br0">&#93;</span> = obmol.<span class="me1">GetTitle</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; coll.<span class="me1">insert</span><span class="br0">&#40;</span>doc<span class="br0">&#41;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; obmol = OBMol<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; notatend = obconversion.<span class="me1">Read</span><span class="br0">&#40;</span>obmol<span class="br0">&#41;</span><br />
<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; n += <span class="nu0">1</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> n <span class="sy0">%</span> <span class="nu0">100</span> == <span class="nu0">0</span>:<br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">sys</span>.<span class="me1">stdout</span>.<span class="me1">write</span><span class="br0">&#40;</span><span class="st0">'Processed %d<span class="es0">\r</span>'</span> <span class="sy0">%</span> <span class="br0">&#40;</span>n<span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">sys</span>.<span class="me1">stdout</span>.<span class="me1">flush</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
<br />
&nbsp; &nbsp; <span class="kw1">print</span> <span class="st0">'Processed %d molecules'</span> <span class="sy0">%</span> <span class="br0">&#40;</span>n<span class="br0">&#41;</span><br />
<br />
&nbsp; &nbsp; coll.<span class="me1">create_index</span><span class="br0">&#40;</span><span class="br0">&#91;</span> <span class="br0">&#40;</span><span class="st0">'PUBCHEM_HEAVY_ATOM_COUNT'</span>, DESCENDING<span class="br0">&#41;</span> &nbsp;<span class="br0">&#93;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; coll.<span class="me1">create_index</span><span class="br0">&#40;</span><span class="br0">&#91;</span> <span class="br0">&#40;</span><span class="st0">'PUBCHEM_MOLECULAR_WEIGHT'</span>, DESCENDING<span class="br0">&#41;</span> &nbsp;<span class="br0">&#93;</span><span class="br0">&#41;</span></div></td></tr></tbody></table></div>
<p style="text-align: justify;">Note that this example loads each molecule on its own and takes a total of 2015.020 sec. It has been noted that bulk loading (i.e., insert a list of documents, rather than individual documents) can be more efficient. I tried this, loading 1000 molecules at a time. But this time round the load time was  2224.691 sec - certainly not an improvement!</p>
<p style="text-align: justify;">Note that the &#8220;_id&#8221; key is a &#8220;primary key&#8217; and thus queries on this field are extremely fast. MongoDB also supports indexes and the code above implements an index on the PUBCHEM_HEAVY_ATOM_COUNT field.</p>
<h3><strong>Queries</strong></h3>
<p style="text-align: justify;">The simplest query is to pull up records based on CID. I selected 8000 CIDs randomly and evaluated how long it&#8217;d take to pull up the records from the database:</p>
<div class="codecolorer-container python " style="overflow:auto;white-space:nowrap;width:685px"><table cellspacing="0" cellpadding="0"><tbody><tr><td class="line-numbers"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br /></div></td><td><div class="python codecolorer" style="font-family:Monaco,Lucida Console,monospace"><span class="kw1">from</span> pymongo <span class="kw1">import</span> Connection<br />
<br />
<span class="kw1">def</span> timeQueryByCID<span class="br0">&#40;</span>cids<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; conn = Connection<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; db = conn.<span class="me1">chem</span><br />
&nbsp; &nbsp; coll = db.<span class="me1">mol2d</span><br />
&nbsp; &nbsp; <span class="kw1">for</span> cid <span class="kw1">in</span> cids:<br />
&nbsp; &nbsp; &nbsp; &nbsp; result = coll.<span class="me1">find</span><span class="br0">&#40;</span> <span class="br0">&#123;</span><span class="st0">'_id'</span> : cid<span class="br0">&#125;</span> <span class="br0">&#41;</span>.<span class="me1">explain</span><span class="br0">&#40;</span><span class="br0">&#41;</span></div></td></tr></tbody></table></div>
<p style="text-align: justify;">The above code takes 2351.95 ms, averaged over 5 runs. This comes out to about 0.3 ms per query. Not bad!</p>
<p style="text-align: justify;">Next, lets look at queries that use the heavy atom count field that we had indexed. For this test I selected 30 heavy atom count values randomly and for each value performed the query. I retrieved the query time as well as the number of hits via <a href="http://api.mongodb.org/python/1.4%2B/tutorial.html">explain()</a>.</p>
<div class="codecolorer-container python " style="overflow:auto;white-space:nowrap;width:685px"><table cellspacing="0" cellpadding="0"><tbody><tr><td class="line-numbers"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br /></div></td><td><div class="python codecolorer" style="font-family:Monaco,Lucida Console,monospace"><span class="kw1">from</span> pymongo <span class="kw1">import</span> Connection<br />
<br />
<span class="kw1">def</span> timeQueryByHeavyAtom<span class="br0">&#40;</span>natom<span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; conn = Connection<span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; db = conn.<span class="me1">chem</span><br />
&nbsp; &nbsp; coll = db.<span class="me1">mol2d</span><br />
&nbsp; &nbsp; o = <span class="kw2">open</span><span class="br0">&#40;</span><span class="st0">'time-natom.txt'</span>, <span class="st0">'w'</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; <span class="kw1">for</span> i <span class="kw1">in</span> natom:<br />
&nbsp; &nbsp; &nbsp; &nbsp; c = coll.<span class="me1">find</span><span class="br0">&#40;</span> <span class="br0">&#123;</span><span class="st0">'PUBCHEM_HEAVY_ATOM_COUNT'</span> : i<span class="br0">&#125;</span> <span class="br0">&#41;</span>.<span class="me1">explain</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; nresult = c<span class="br0">&#91;</span><span class="st0">'n'</span><span class="br0">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; elapse = c<span class="br0">&#91;</span><span class="st0">'millis'</span><span class="br0">&#93;</span><br />
&nbsp; &nbsp; &nbsp; &nbsp; o.<span class="me1">write</span><span class="br0">&#40;</span><span class="st0">'%d<span class="es0">\t</span>%d<span class="es0">\t</span>%f<span class="es0">\n</span>'</span> <span class="sy0">%</span> <span class="br0">&#40;</span>i, nresult, elapse<span class="br0">&#41;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; o.<span class="me1">close</span><span class="br0">&#40;</span><span class="br0">&#41;</span></div></td></tr></tbody></table></div>
<p style="text-align: justify;">A summary of these queries is shown in the graphs below.</p>
<p style="text-align: justify;"><a href="http://blog.rguha.net/wp-content/uploads/2010/02/perf.png"><img class="aligncenter size-medium wp-image-477" title="Query performance for heavy atom counts" src="http://blog.rguha.net/wp-content/uploads/2010/02/perf-300x150.png" alt="" width="300" height="150" /></a></p>
<p style="text-align: justify;">One of the queries is anomalous - there are 93K molecules with 24 heavy atoms, but the query is performed in 139 ms. This might be due to priming while I was testing code.</p>
<h3><strong>Some thoughts</strong></h3>
<p style="text-align: justify;">One thing that was apparent from the little I&#8217;ve played with MongoDB is that it&#8217;s extremely easy to use. I&#8217;m sure that larger installs (say on a cluster) could be more complex, but for single user apps, setup is really trivial. Furthermore, basic operations like insertion and querying are extremely easy. The idea of being able to dump any type of data (as a document) without worrying whether it will fit into a pre-defined schema is a lot of fun.</p>
<p style="text-align: justify;">However, it&#8217;s advantages also seem to be its limitations (though this is not specific to MongoDB). This was also noted in a <a href="http://blog.rguha.net/?p=470#comment-5247">comment</a> on my <a hef="http://blog.rguha.net/?p=470">previous post</a>. It seems that MongoDB is very efficient for <i>simplistic queries</i>. One of the things that I haven&#8217;t properly worked out is whether this type of system makes sense for a molecule-centric database. The primary reason is that molecules can be referred by a variety of identifiers. For example, when searching PubChem, a query by CID is just one of the ways one might pull up data. As a result, any database holding this type of data will likely require multiple indices. So, why not stay with an RDBMS? Furthermore, in my previous post, I had mentioned that a cool feature would be able to dump molecules from arbitrary sources into the DB, without worrying about fields. While very handy when loading data, it does present some complexities at query time. How does one perform a query over <i>all</i> molecules? This can be addressed in multiple ways (registration etc.) but is essentially what must be done in an RDBMS scenario.</p>
<p style="text-align: justify;">Another things that became apparent is the fact that MongoDB and its ilk <a href="http://stackoverflow.com/questions/1995216/join-operation-with-nosql">don&#8217;t support JOINs</a>. While the current example doesn&#8217;t really highlight this, it is trivial to consider adding say bioassay data and then querying both tables using a JOIN. In contrast, the NoSQL approach is to perform multiple queries and then do the join in your own code. This seems inelegant and a bit painful (at least for the types of applications that I work with).</p>
<p style="text-align: justify;">Finally, one of my interests was to make use of the map/reduce functionality in MongoDB. However, it appears that such queries <string>must</strong> be implemented in Javascript. As a result, performing cheminformatics operations (using some other language or external libraries) within map or reduce functions is <a href="http://groups.google.com/group/mongodb-user/browse_thread/thread/17883e649cf6cafe">not currently possible</a>.</p>
<p style="text-align: justify;">But of course, NoSQL DB&#8217;s were not designed to replace RDBMS. Both technologies have their place, and I don&#8217;t believe that one is better than the other. Just that one might be better suited to a given application than the other.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=472</wfw:commentRss>
		</item>
		<item>
		<title>Cheminformatics and Non-Relational Datastores</title>
		<link>http://blog.rguha.net/?p=470</link>
		<comments>http://blog.rguha.net/?p=470#comments</comments>
		<pubDate>Thu, 04 Feb 2010 05:51:43 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[cheminformatics]]></category>

		<category><![CDATA[software]]></category>

		<category><![CDATA[database]]></category>

		<category><![CDATA[nosql]]></category>

		<category><![CDATA[performance]]></category>

		<category><![CDATA[rdbms]]></category>

		<category><![CDATA[registration]]></category>

		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=470</guid>
		<description><![CDATA[Over the past year or so I&#8217;ve been seeing a variety of non-relational data stores coming up.  They also go by terms such as document databases or key/value stores (or even NoSQL databases). These systems are alternatives to traditional RDBMS&#8217;s in that they do not require explicit schema defined a priori. While they do [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Over the past year or so I&#8217;ve been seeing a variety of <a href="http://en.wikipedia.org/wiki/NoSQL">non-relational data stores</a> coming up.  They also go by terms such as document databases or key/value stores (or even NoSQL databases). These systems are alternatives to traditional RDBMS&#8217;s in that they do not require explicit schema defined <em>a priori</em>. While they do not offer transactional guarantees (<a href="http://en.wikipedia.org/wiki/ACID">ACID</a>) compared to RDBMS&#8217;s, they claim flexibility, speed and scalability. Examples include <a href="http://couchdb.apache.org/">CouchDB</a>, <a href="http://www.mongodb.org">MongoDB</a> and <a href="http://1978th.net/">Tokyo Cabinet</a>. <a href="http://plindenbaum.blogspot.com/2009/04/couchdb-for-bioinformatics-storing-snps.html">Pierre</a> and <a href="http://bcbio.wordpress.com/2009/05/10/evaluating-key-value-and-document-stores-for-short-read-data/">Brad</a> have described some examples of using CouchDB with bioinformatics data and <a href="http://depth-first.com/">Rich</a> has started a series on the <a href="http://depth-first.com/articles/2010/01/28/pubcouch-install-couchdb-on-ubuntu-karmic-from-source">use of CouchDB</a> to <a href="http://depth-first.com/articles/2010/01/20/pubcouch-a-couchdb-interface-to-pubchem">store PubChem data</a>.</p>
<p style="text-align: justify;">Having used RDBMS&#8217;s such as <a href="http://www.postgresql.org/">PostgreSQL</a> and <a href="http://www.oracle.com/index.html">Oracle</a> for some time, I&#8217;ve wondered how or why one might use these systems for cheminformatics applications. Rich&#8217;s posts describe how one might go about using CouchDB to store SD files, but it wasn&#8217;t clear to me what advantage it provided over say, PostgreSQL.</p>
<p style="text-align: justify;">I now realize that if you wanted to store <strong>arbitrary chemical data from multiple sources</strong> a document oriented database makes life significantly easier compared to a traditional RDBMS. While Rich&#8217;s <a href="http://depth-first.com/articles/2010/01/20/pubcouch-a-couchdb-interface-to-pubchem">post</a> considers SD files from PubChem (which will have the same set of SD tags), CouchDB and its ilk become really useful when one considers, say, SD files from arbitrary sources. Thus, if one were designing a chemical registration system, the core would involve storing structures and an associated identifier. However, if the compounds came with arbitrary fields attached to them, how can we easily and efficiently store them? It&#8217;s certainly doable via SQL (put each field name into &#8216;dictionary&#8217; table etc) but it seems a little hacky.</p>
<p style="text-align: justify;">On the other hand, one could trivially transform an SD formatted structure to a JSON-like document and then dump that into CouchDB. In other words, one need not worry about updating a schema. Things become more interesting when storing associated non-structural data - assays, spectra and so on. When I initially set up the IU PubChem mirror, it was tricky to store all the bioassay data since the schema for assays was not necessarily identical. But I now see that such a scenario is perfect for a document oriented database.</p>
<p style="text-align: justify;">However some questions still remain. Most fundamentally, <strong>how does not having a schema affect query performance</strong>? Thus if I were to dump all compounds in PubChem into CouchDB, pulling out details for a given compound ID should be very fast. But what if I wanted to retrieve compounds with a molecular weight less than 250? In a traditional RDBMS, the molecular weight would be a column, preferably with an index. So such queries would be fast. But if the molecular weight is just a document property, it&#8217;s not clear that such a query would (or could) be very fast in a document oriented DB (would it require linear scans?). I note that I haven&#8217;t RTFM so I&#8217;d be happy to be corrected!</p>
<p style="text-align: justify;">However I&#8217;d expect that substructure search performance wouldn&#8217;t differ much between the two types of database systems. In fact, with the <a href="http://en.wikipedia.org/wiki/MapReduce">map/reduce</a> features of CouchDB and MongoDB, such searches could in fact be significantly faster (though Oracle is capable of <a href="http://www.orafaq.com/wiki/Parallel_Query_FAQ">parallel queries</a>).This also leads to the interesting topic of how one would integrate cheminformatics capabilities into a document-oriented DB (akin to a cheminformatics cartridge for an RDBMS).</p>
<p style="text-align: justify;">So it looks like I&#8217;m going to have to play around and see how all this works.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=470</wfw:commentRss>
		</item>
		<item>
		<title>WPTouch - Painlessly Optimize Blogs for Mobile Devices</title>
		<link>http://blog.rguha.net/?p=468</link>
		<comments>http://blog.rguha.net/?p=468#comments</comments>
		<pubDate>Sun, 31 Jan 2010 18:30:19 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[software]]></category>

		<category><![CDATA[blog]]></category>

		<category><![CDATA[css]]></category>

		<category><![CDATA[mobile]]></category>

		<category><![CDATA[web]]></category>

		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=468</guid>
		<description><![CDATA[I came across WPTouch,  a Wordpress plugin/theme that optimizes a blogs appearance for fast loading on mobile devices such as the iPhone, Android and so on. It&#8217;s really trivial to install (just like any other plugin) and once done, browsing this site on my iPhone is really much nicer than viewing the full fledged desktop [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">I came across <a href="http://www.bravenewcode.com/products/wptouch/">WPTouch</a>,  a Wordpress plugin/theme that optimizes a blogs appearance for fast loading on mobile devices such as the iPhone, Android and so on. It&#8217;s really trivial to install (just like any other plugin) and once done, browsing this site on my iPhone is really much nicer than viewing the full fledged desktop theme.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=468</wfw:commentRss>
		</item>
		<item>
		<title>When is a Bad Plate Bad?</title>
		<link>http://blog.rguha.net/?p=464</link>
		<comments>http://blog.rguha.net/?p=464#comments</comments>
		<pubDate>Fri, 29 Jan 2010 17:47:24 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[bioinformatics]]></category>

		<category><![CDATA[research]]></category>

		<category><![CDATA[visualization]]></category>

		<category><![CDATA[HTS]]></category>

		<category><![CDATA[quality]]></category>

		<category><![CDATA[rnai]]></category>

		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=464</guid>
		<description><![CDATA[When running a high-throughput screen, one usually deals with hundreds or even thousands of plates. Due to the vagaries of experiments, some plates will not be ervy good. That is, the data will be of poor quality due to a variety of reasons. Usually we can evaluate various statistical quality metrics to asses which plates [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">When running a high-throughput screen, one usually deals with hundreds or even thousands of plates. Due to the vagaries of experiments, some plates will not be ervy good. That is, the data will be of poor quality due to a variety of reasons. Usually we can evaluate various <a href="http://dx.doi.org/10.1038/nbt1186 ">statistical quality metrics</a> to asses which plates are good and which ones need to be redone. A common metric is the <a href="http://en.wikipedia.org/wiki/Z-factor">Z-factor</a> which uses the positive and negative control wells. The problem is, that if one or two wells have a problem (say, no signal in the negative control) then the Z-factor will be very poor. Yet, the plate could be used if we just mask those bad wells.</p>
<p style="text-align: justify;">Now, for our current screens (100 plates) manual inspection is boring but doable. As we move to genome-wide screens we need a better way to identify truly bad plates from plates that could be used. One approach is to move to other metrics - SSMD (<a href="http://jbx.sagepub.com/cgi/content/abstract/12/4/497">defined here</a> and applications to quality control <a href="http://jbx.sagepub.com/cgi/content/abstract/13/5/363">discussed here</a>) is regarded as more effective than Z-factor - and in fact it&#8217;s advisable to look at multiple metrics rather than depend on any single one.</p>
<p style="text-align: justify;">An alternative trick is to compare the Z-factor for a given plate to the <em>trimmed</em> Z-factor, which is evaluated using the <a href="http://en.wikipedia.org/wiki/Truncated_mean">trimmed mean</a> and standard deviations. In our set up we trim 10% of the positive and negative control wells. For a plate that appears to be poor, due to one or two bad control wells, the trimmed Z-factor should be significantly higher than the original Z-factor. But for a plate in which, say the negative control wells all show poor signal, there should not be much of a difference between the two values. The analysis can be rapidly performed using a plot of the two values, as shown below. Given such a plot, we&#8217;d probably consider plates whose trimmed Z-factor are less than 0.5  and close to the diagonal. (Though for RNAi screens, Z&#8217; = 0.5 might be too stringent).</p>
<p style="text-align: justify;">From the figure below, just looking at Z-factor would have suggested 4 or 5 plates to redo. But when compared to the trimmed Z-factor, this comes down to a single plate. Of course, we&#8217;d look at other statistics as well, but it is a quick way to rapidly identify plates with truly poor Z-factors.</p>
<div id="attachment_465" class="wp-caption aligncenter" style="width: 310px"><a href="http://blog.rguha.net/wp-content/uploads/2010/01/ztz.png"><img class="size-medium wp-image-465" title="ztz" src="http://blog.rguha.net/wp-content/uploads/2010/01/ztz-300x300.png" alt="A plot of Z-factor versus trimmed Z-factor for a set of 100 plates" width="300" height="300" /></a><p class="wp-caption-text">A plot of Z-factor versus trimmed Z-factor for a set of 100 plates</p></div>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=464</wfw:commentRss>
		</item>
		<item>
		<title>A GPL3 Oracle Cheminformatics Cartridge</title>
		<link>http://blog.rguha.net/?p=462</link>
		<comments>http://blog.rguha.net/?p=462#comments</comments>
		<pubDate>Sun, 24 Jan 2010 14:35:25 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[software]]></category>

		<category><![CDATA[cheminformatics]]></category>

		<category><![CDATA[database]]></category>

		<category><![CDATA[oracle]]></category>

		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=462</guid>
		<description><![CDATA[Sometime back I had mentioned a new cheminformatics toolkit, Indigo. Recently, Dmitry from SciTouch let me know that they had also developed Bingo, an Oracle cartridge based on Indigo, to perform cheminformatics operations in the database. This expands the current ecosystem of Open Source database cartridges (PGChem, MyChem, OrChem) which pretty much covers all the [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Sometime back I had <a href="http://blog.rguha.net/?p=436">mentioned</a> a new cheminformatics toolkit, <a href="http://opensource.scitouch.net/indigo/">Indigo</a>. Recently, Dmitry from SciTouch let me know that they had also developed <a href="http://opensource.scitouch.net/indigo/bingo#download_and_install">Bingo</a>, an Oracle cartridge based on Indigo, to perform cheminformatics operations in the database. This expands the current ecosystem of Open Source database cartridges (<a href="http://pgfoundry.org/projects/pgchem/">PGChem</a>, <a href="http://mychem.sourceforge.net/">MyChem</a>, <a href="http://orchem.sourceforge.net/">OrChem</a>) which pretty much covers all the main RDBMSs (Postgres, MyQSL and Oracle). SciTouch have also provided a <a href="http://groups.google.com/group/indigo-general/browse_thread/thread/6ebbfe6c5a7665bc">live instance</a> of their database and associated cartridge, so you can play with it without requiring a local Oracle install. (It&#8217;d be useful to provide some details of the hardware that the DB is running on, so that timing numbers get some context)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=462</wfw:commentRss>
		</item>
		<item>
		<title>Slides from a Guest Lecture at Drexel University</title>
		<link>http://blog.rguha.net/?p=460</link>
		<comments>http://blog.rguha.net/?p=460#comments</comments>
		<pubDate>Sat, 05 Dec 2009 23:08:17 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[cheminformatics]]></category>

		<category><![CDATA[chemspider]]></category>

		<category><![CDATA[drexel]]></category>

		<category><![CDATA[fingerprint]]></category>

		<category><![CDATA[lecture]]></category>

		<category><![CDATA[search]]></category>

		<category><![CDATA[similarity]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=460</guid>
		<description><![CDATA[On Thursay I joined Antony Williams as a guest lecturer in Jean Claude-Bradleys&#8216; class on chemical information retrieval at Drexel University. Using a combination of WebEx and Skype, we were able to give our presentations - seamlessly joining three different locations. Technology is great! Tony gave an excellent talk on citizen science and ChemSpider and [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">On Thursay I joined <a href="http://www.chemspider.com/blog/">Antony Williams</a> as a guest lecturer in <a href="http://usefulchem.blogspot.com/">Jean Claude-Bradleys</a>&#8216; <a href="http://getcheminfo.wikispaces.com/">class</a> on chemical information retrieval at Drexel University. Using a combination of WebEx and Skype, we were able to give our presentations - seamlessly joining three different locations. Technology is great! Tony gave an excellent talk on <a href="http://www.chemspider.com/blog/a-presentation-to-students-at-drexel-university-via-webex-and-skype.html">citizen science and ChemSpider</a> and I spoke about <a href="http://www.slideshare.net/rguha/molecularrepresentation-similarityandsearch">similarity and searching</a>. Jean Claude has also put up an <a href="http://www.scivee.tv/node/14791">audio version</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=460</wfw:commentRss>
		</item>
		<item>
		<title>Are Bioinformatics Results Too Good To Be True?</title>
		<link>http://blog.rguha.net/?p=457</link>
		<comments>http://blog.rguha.net/?p=457#comments</comments>
		<pubDate>Sun, 29 Nov 2009 15:27:11 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[Literature]]></category>

		<category><![CDATA[bioinformatics]]></category>

		<category><![CDATA[cheminformatics]]></category>

		<category><![CDATA[bias]]></category>

		<category><![CDATA[significance]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=457</guid>
		<description><![CDATA[I came across an interesting paper by Ann Boulesteix where she discusses the problem of false positive results being reported in the bioinformatics literature. She highlights two underlying phenomena that lead to this issue - &#8220;fishing for significance&#8221; and &#8220;publication bias&#8221;.
The former phenomenon is characterized by researchers identifying datasets on which their method works better than [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">I came across an interesting <span></span><a href="http://dx.doi.org/10.1093/bioinformatics/btp648">paper</a> by Ann Boulesteix where she discusses the problem of false positive results being reported in the bioinformatics literature. She highlights two underlying phenomena that lead to this issue - &#8220;fishing for significance&#8221; and &#8220;publication bias&#8221;.</p>
<p style="text-align: justify;">The former phenomenon is characterized by researchers identifying datasets on which their method works better than others or where a new method is (unconciously) optimized for  given set of datasets.  Then there is also the issue of validation of new methodologies, where she notes</p>
<blockquote style="text-align: justify;"><p>&#8230; ﬁtting a prediction model and estimating its error rate using the same training data set yields a downwardly biased error estimate commonly termed as ”apparent error”. Validation on independent fresh data is an important component of all prediction studies&#8230;</p></blockquote>
<p>Boulesteix also points out that true, prospective validation is not always possible since the data may not be easily accessible to even available. She also notes that some of these problems could be mitigated by authors being very clear about the limitations and dataset assumptions they make. As I have been reading the microarray literature recently to help me with RNAi screening data, I have seen the problem firsthand. There are hundreds of papers on normalization techniques and gene selection methods. And each one claims to be better than the others. But in most cases, the improvements seem incremental. Is the difference really significant? It&#8217;s not always clear.</p>
<p>I&#8217;ll also note that this same problem is also likely present in the cheminformatics literature. There are any papers which claim that their SVM (or some other algorithm) implementation does better than previous reports on modeling something or the other. Is a 5% improvement really that good? Is it significant? Luckily there are recent efforts, such as <a href="http://sampl.eyesopen.com/">SAMPL</a> and the <a href="http://www-jmg.ch.cam.ac.uk/data/solubility/">solubility challenge</a> to address these issues in various areas of cheminformatics. Also, there is a nice and very simple <a href="http://dx.doi.org/10.1186/1471-2105-10-225">metric</a> recently developed to compare different methods (focusing on rankings generated by virtual screening methods).</p>
<p>The issue of publication bias also plays a role in this problem - negative results are difficult to publish and hence a researcher will try and find a positive spin on results that may not even be significant. For example, a well designed methodology paper will be difficult to publish if it cannot be shown to be better than other methods. One could get around such a rejection by cherry picking datasets (even when noting that such a dataset is cherry picked, it limits the utility of the paper in my opinion), or by avoiding comparisons with certain other methods. So while a researcher may end up with a paper, it&#8217;s more CV padding than an actual improvement in the state of the art.</p>
<p>But as Boulesteix notes, &#8220;<em>a negative aspect &#8230; may be counterbalanced by positive aspects</em>&#8220;. Thus even though a method might not provide better accuracy than other methods, it might be better suited for specific situations or may provide a new insight into the underlying problem or even highlight open questions.</p>
<p>While the observations in this paper are not new, they are well articulated and highlight the dangers that can arise from a publish-or-perish and positive-results-only system.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=457</wfw:commentRss>
		</item>
		<item>
		<title>New Version of rpubchem</title>
		<link>http://blog.rguha.net/?p=454</link>
		<comments>http://blog.rguha.net/?p=454#comments</comments>
		<pubDate>Sat, 21 Nov 2009 02:11:14 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[software]]></category>

		<category><![CDATA[CRAN]]></category>

		<category><![CDATA[pubchem]]></category>

		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=454</guid>
		<description><![CDATA[Version 1.4.3 of rpubchem is out on CRAN. There&#8217;s some minor code cleanups and also a new function called get.aid.by.cid which allows you to get assay ID&#8217;s based on whether they contain a compound (either as an active, inactive, discrepant or just tested). This uses PUG to perform the query, so can be a bit [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">Version 1.4.3 of <a href="http://cran.r-project.org/web/packages/rpubchem/index.html">rpubchem</a> is out on CRAN. There&#8217;s some minor code cleanups and also a new function called <em>get.aid.by.cid</em> which allows you to get assay ID&#8217;s based on whether they contain a compound (either as an active, inactive, discrepant or just tested). This uses <a href="http://pubchem.ncbi.nlm.nih.gov/pug/pughelp.html">PUG</a> to perform the query, so can be a bit slow (and occasionally just fail).</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=454</wfw:commentRss>
		</item>
		<item>
		<title>Frequency of a Term via PubMed</title>
		<link>http://blog.rguha.net/?p=443</link>
		<comments>http://blog.rguha.net/?p=443#comments</comments>
		<pubDate>Tue, 10 Nov 2009 23:50:16 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[software]]></category>

		<category><![CDATA[eutils]]></category>

		<category><![CDATA[pubmed]]></category>

		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=443</guid>
		<description><![CDATA[A little while back, Egon posted a question on FriendFeed, asking whether there was an easy way, preferably a service, to determine and plot the usage count of a term in PubMed by year. This is simple enough using the Entrez Utilities CGI. A quick Python script to do this (with minimal error checking) is [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">A little while back, <a href="http://chem-bla-ics.blogspot.com/">Egon</a> posted a <a href="http://friendfeed.com/egonw/7cc53733/is-there-easy-way-to-make-plot-of-usage-count-term">question</a> on FriendFeed, asking whether there was an easy way, preferably a service, to determine and plot the usage count of a term in <a href="http://www.ncbi.nlm.nih.gov/pubmed">PubMed</a> by year. This is simple enough using the <a href="http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html">Entrez Utilities</a> CGI. A quick Python script to do this (with minimal error checking) is given below. It&#8217;d be relatively trivial to wrap this as a <a href="http://www.modpython.org/">mod_python</a> application and generate a bar plot directly (either using Python or using one of the online charting API&#8217;s)</p>
<div class="codecolorer-container python " style="overflow:auto;white-space:nowrap;width:685px"><table cellspacing="0" cellpadding="0"><tbody><tr><td class="line-numbers"><div>1<br />2<br />3<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br />11<br />12<br />13<br /></div></td><td><div class="python codecolorer" style="font-family:Monaco,Lucida Console,monospace"><span class="kw1">import</span> <span class="kw3">urllib</span><br />
<span class="kw1">import</span> <span class="kw3">xml</span>.<span class="me1">etree</span>.<span class="me1">ElementTree</span> <span class="kw1">as</span> ET<br />
<br />
u = <span class="st0">&quot;http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?term=%s&amp;mindate=%d/01/01&amp;maxdate=%d/12/31&quot;</span><br />
term = <span class="st0">&quot;artemisinin resistance&quot;</span><br />
startYear = <span class="nu0">1998</span><br />
endYear = <span class="nu0">2009</span><br />
<span class="kw1">for</span> year <span class="kw1">in</span> <span class="kw2">range</span><span class="br0">&#40;</span>startYear, endYear+<span class="nu0">1</span><span class="br0">&#41;</span>:<br />
&nbsp; &nbsp; url = u <span class="sy0">%</span> <span class="br0">&#40;</span>term.<span class="me1">replace</span><span class="br0">&#40;</span><span class="st0">&quot; &quot;</span>, <span class="st0">&quot;+&quot;</span><span class="br0">&#41;</span>, year, year<span class="br0">&#41;</span><br />
&nbsp; &nbsp; page = <span class="kw3">urllib</span>.<span class="me1">urlopen</span><span class="br0">&#40;</span>url<span class="br0">&#41;</span>.<span class="me1">read</span><span class="br0">&#40;</span><span class="br0">&#41;</span><br />
&nbsp; &nbsp; doc = ET.<span class="me1">XML</span><span class="br0">&#40;</span>page<span class="br0">&#41;</span><br />
&nbsp; &nbsp; count = doc.<span class="me1">find</span><span class="br0">&#40;</span><span class="st0">&quot;Count&quot;</span><span class="br0">&#41;</span>.<span class="me1">text</span><br />
&nbsp; &nbsp; <span class="kw1">print</span> year, count</div></td></tr></tbody></table></div>
<h3><b>Update 1</b></h3>
<p style="text-align: justify;">A little more hacking and the above code was converted to a mod_python application, which can be accessed using a URL of the form <i>http://rest.rguha.net/usage/usage.py?term=TERM&#038;syear=1997&#038;eyear=2009</i>. With the help of the handy <a href="http://pygooglechart.slowchop.com/">pygooglechart</a> module, the above URL returns an <i>&lt;img&gt;</i> tag containing the appropriate <a href="http://code.google.com/apis/chart/">Google Charts</a> URL. As a an example, the term &#8220;artemisinin resistance&#8221;  results in this <a href="http://rest.rguha.net/usage/usage.py?term=artemisinin+resistance&#038;syear=1997&#038;eyear=2009">image</a>.</p>
<h3><b>Update 2</b></h3>
<p style="text-align: justify;"><a href="http://www.lumc.nl/rep/cod/redirect/1060/persoonlijke%20pagina/jan.html">Jan Schoones</a> pointed out in a <a href="http://blog.rguha.net/?p=443#comment-3325">comment</a> that my artemisinin resistance example was slightly incorrect, as the resultant PubMed search does not search for the exact phrase, but rather, looks for documents that contain the words &#8220;artemisinin&#8221; and &#8220;resistance&#8221;. This is because the example URL does not include the quotes around the phrase. A more correct example would be <a href='http://rest.rguha.net/usage/usage.py?term="artemisinin+resistance"&#038;syear=1997&#038;eyear=2009'>here</a>, where we search for the <i>phrase</i>, rather than individual words.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=443</wfw:commentRss>
		</item>
		<item>
		<title>Updated Versions of R Packages</title>
		<link>http://blog.rguha.net/?p=441</link>
		<comments>http://blog.rguha.net/?p=441#comments</comments>
		<pubDate>Fri, 06 Nov 2009 01:04:11 +0000</pubDate>
		<dc:creator>Rajarshi Guha</dc:creator>
		
		<category><![CDATA[cheminformatics]]></category>

		<category><![CDATA[software]]></category>

		<category><![CDATA[CRAN]]></category>

		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://blog.rguha.net/?p=441</guid>
		<description><![CDATA[New versions of several of my R packages are now available on CRAN. rcdk 2.9.6 goes along with rcdklibs 1.2.3. The latter now uses the most recent cdk-1.2.x branch from Github. The former fixes a number of bugs relating to descriptor calculations, saving molecules in SD format and setting/getting properties on molecules. Unfortunately, because the [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">New versions of several of my R packages are now available on CRAN. <a href="http://cran.r-project.org/web/packages/rcdk/index.html">rcdk 2.9.6</a> goes along with <a href="http://cran.r-project.org/web/packages/rcdklibs/index.html">rcdklibs 1.2.3</a>. The latter now uses the most recent <a href="http://github.com/egonw/cdk/tree/cdk-1.2.x">cdk-1.2.x branch</a> from Github. The former fixes a number of bugs relating to descriptor calculations, saving molecules in SD format and setting/getting properties on molecules. Unfortunately, because the 1.2.x branch does not have robust depiction code, the visualization methods in rcdk are currently disabled. The <a href="http://cran.r-project.org/web/packages/fingerprint/index.html">fingerprint</a> package has also been updated and now includes a number of unit tests.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rguha.net/?feed=rss2&amp;p=441</wfw:commentRss>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.732 seconds -->
<!-- Cached page generated by WP-Super-Cache on 2010-02-09 08:38:38 -->
<!-- Compression = gzip -->