<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments for So much to do, so little time</title>
	<atom:link href="http://blog.rguha.net/?feed=comments-rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.rguha.net</link>
	<description>Trying to squeeze sense out of chemical data</description>
	<pubDate>Tue, 09 Feb 2010 09:10:33 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.5</generator>
		<item>
		<title>Comment on Molecules &#038; MongoDB - Numbers and Thoughts by Rajarshi Guha</title>
		<link>http://blog.rguha.net/?p=472#comment-5297</link>
		<dc:creator>Rajarshi Guha</dc:creator>
		<pubDate>Mon, 08 Feb 2010 04:17:26 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=472#comment-5297</guid>
		<description>Neil, thanks for the detailed comment.

Regarding chaining, I've seen that MongoDB supports queries of a certain complexity. But as far as I can see, the complexity is primarily restricted to AND clauses. I haven't seen an example of OR in the MondoDB docs (though maybe just throwing multiple queries is the best way).

On your comment about joins - I see your point. In the end, given that NoSQL does away with RDBMS semantics such as normalization, it seems that duplication is acceptable (even preferred?). However, even with duplication, it seems that you still have define duplicate entities a priori, rather than on the fly.

Good point about non-relational thinking when working with these systems</description>
		<content:encoded><![CDATA[<p>Neil, thanks for the detailed comment.</p>
<p>Regarding chaining, I&#8217;ve seen that MongoDB supports queries of a certain complexity. But as far as I can see, the complexity is primarily restricted to AND clauses. I haven&#8217;t seen an example of OR in the MondoDB docs (though maybe just throwing multiple queries is the best way).</p>
<p>On your comment about joins - I see your point. In the end, given that NoSQL does away with RDBMS semantics such as normalization, it seems that duplication is acceptable (even preferred?). However, even with duplication, it seems that you still have define duplicate entities a priori, rather than on the fly.</p>
<p>Good point about non-relational thinking when working with these systems</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Cheminformatics and Non-Relational Datastores by Neil</title>
		<link>http://blog.rguha.net/?p=470#comment-5296</link>
		<dc:creator>Neil</dc:creator>
		<pubDate>Mon, 08 Feb 2010 04:02:28 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=470#comment-5296</guid>
		<description>Joerg, documents in MongoDB are stored in a format named BSON, or "binary JSON". It's a binary representation of a JSON data structure, so is optimized and fast (and limited to 4MB per document).  No indexed/zipped text files!</description>
		<content:encoded><![CDATA[<p>Joerg, documents in MongoDB are stored in a format named BSON, or &#8220;binary JSON&#8221;. It&#8217;s a binary representation of a JSON data structure, so is optimized and fast (and limited to 4MB per document).  No indexed/zipped text files!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Molecules &#038; MongoDB - Numbers and Thoughts by Neil</title>
		<link>http://blog.rguha.net/?p=472#comment-5295</link>
		<dc:creator>Neil</dc:creator>
		<pubDate>Mon, 08 Feb 2010 03:54:39 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=472#comment-5295</guid>
		<description>Nice summary of MongoDB strengths and weaknesses. I had a few thoughts.

First, with regard to complex queries, the mongoid ruby library is very good (mongoid.org). It uses a DSL called Criteria, which allows method chaining. To use one of their examples, queries like "Person.only(:first_name).where("phones.country_code" =&#62; 1).in(:last_name =&#62; ["Vicious"])". Perhaps python has similar.

With regard to joins, the question is always whether to embed related data inside a document (as an array/hash-type structure), or link to it in a separate collection. The MongoDB website has quite a lot of discussion on this topic. You might, for example, store the IDs from one collection in an array in another collection as a quick way to join the two.

I agree with your point about just "dumping" data, as opposed to requiring some schema for queries. I find the nice thing about MongoDB is that you can dump first, then think about it later. So you can inspect the atrributes of an object, decide which ones look useful, index them and add them to the object model as you see fit. Again, I find that the ruby mappers (mongomapper and mongoid) work very well for this.

I find the map-reduce barrier quite high too. Some of the libraries let you write queries (e.g. the basic ruby mongo driver has a map_reduce method), but you still have to write in javascript. Perhaps someone will come up with a DSL, a bit like RJS in Rails which allows queries in the native syntax of your language of choice.

I think the most important thing when starting out is to stop thinking in a RDBMS way. MongoDB is maybe more confusing than others because it has relational-like aspects, but I find it's best to ignore them and concentrate on good document design.</description>
		<content:encoded><![CDATA[<p>Nice summary of MongoDB strengths and weaknesses. I had a few thoughts.</p>
<p>First, with regard to complex queries, the mongoid ruby library is very good (mongoid.org). It uses a DSL called Criteria, which allows method chaining. To use one of their examples, queries like &#8220;Person.only(:first_name).where(&#8221;phones.country_code&#8221; =&gt; 1).in(:last_name =&gt; ["Vicious"])&#8221;. Perhaps python has similar.</p>
<p>With regard to joins, the question is always whether to embed related data inside a document (as an array/hash-type structure), or link to it in a separate collection. The MongoDB website has quite a lot of discussion on this topic. You might, for example, store the IDs from one collection in an array in another collection as a quick way to join the two.</p>
<p>I agree with your point about just &#8220;dumping&#8221; data, as opposed to requiring some schema for queries. I find the nice thing about MongoDB is that you can dump first, then think about it later. So you can inspect the atrributes of an object, decide which ones look useful, index them and add them to the object model as you see fit. Again, I find that the ruby mappers (mongomapper and mongoid) work very well for this.</p>
<p>I find the map-reduce barrier quite high too. Some of the libraries let you write queries (e.g. the basic ruby mongo driver has a map_reduce method), but you still have to write in javascript. Perhaps someone will come up with a DSL, a bit like RJS in Rails which allows queries in the native syntax of your language of choice.</p>
<p>I think the most important thing when starting out is to stop thinking in a RDBMS way. MongoDB is maybe more confusing than others because it has relational-like aspects, but I find it&#8217;s best to ignore them and concentrate on good document design.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Cheminformatics and Non-Relational Datastores by Molecules &#38; MongoDB - Numbers and Thoughts at So much to do, so little time</title>
		<link>http://blog.rguha.net/?p=470#comment-5294</link>
		<dc:creator>Molecules &#38; MongoDB - Numbers and Thoughts at So much to do, so little time</dc:creator>
		<pubDate>Mon, 08 Feb 2010 02:18:34 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=470#comment-5294</guid>
		<description>[...] my previous post I had mentioned that key/value or non-relational data stores could be useful in certain [...]</description>
		<content:encoded><![CDATA[<p>[...] my previous post I had mentioned that key/value or non-relational data stores could be useful in certain [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Cheminformatics and Non-Relational Datastores by Ernst-Georg Schmid</title>
		<link>http://blog.rguha.net/?p=470#comment-5247</link>
		<dc:creator>Ernst-Georg Schmid</dc:creator>
		<pubDate>Fri, 05 Feb 2010 12:14:30 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=470#comment-5247</guid>
		<description>One thing to consider when deciding between RDBMS and a no-SQL database is reporting. Especially in corporate environments.

1.) Most off the shelf reporting tools expect an SQL interface. Period.

2.) While key-value stores and the like can compete or outrun RDBMS in straight searches or transactions, I wonder what happens when you need complex joins, projections?

It's all raw power and no finesse. Often raw power is just enough. But often you need some finesse also.</description>
		<content:encoded><![CDATA[<p>One thing to consider when deciding between RDBMS and a no-SQL database is reporting. Especially in corporate environments.</p>
<p>1.) Most off the shelf reporting tools expect an SQL interface. Period.</p>
<p>2.) While key-value stores and the like can compete or outrun RDBMS in straight searches or transactions, I wonder what happens when you need complex joins, projections?</p>
<p>It&#8217;s all raw power and no finesse. Often raw power is just enough. But often you need some finesse also.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Cheminformatics and Non-Relational Datastores by Joerg Kurt Wegner</title>
		<link>http://blog.rguha.net/?p=470#comment-5236</link>
		<dc:creator>Joerg Kurt Wegner</dc:creator>
		<pubDate>Thu, 04 Feb 2010 18:45:28 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=470#comment-5236</guid>
		<description>Excellent thoughts ! I am curious seeing some performance for a couple of million documents. Besides, in which format are the documents typically stored? My last experience with hundred thousands of text files caused a server crash due to file system indexing saturation. On the other hand, if files are getting bundled in zip file, then indexing (direct jumps to data entries) is breaking down. Any experience with scalability and the number of possible documents?</description>
		<content:encoded><![CDATA[<p>Excellent thoughts ! I am curious seeing some performance for a couple of million documents. Besides, in which format are the documents typically stored? My last experience with hundred thousands of text files caused a server crash due to file system indexing saturation. On the other hand, if files are getting bundled in zip file, then indexing (direct jumps to data entries) is breaking down. Any experience with scalability and the number of possible documents?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Cheminformatics and Non-Relational Datastores by Ernst-Georg Schmid</title>
		<link>http://blog.rguha.net/?p=470#comment-5229</link>
		<dc:creator>Ernst-Georg Schmid</dc:creator>
		<pubDate>Thu, 04 Feb 2010 09:11:51 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=470#comment-5229</guid>
		<description>I'm currently playing around with Prevayler http://www.prevayler.org/ combined with MX or the CDK for handling chemical data.

At the moment I can say that it works as expected at least with smaller datasets (~20000 structures), is easier to program (only Java, no SQL) and a lot (10x-100x) faster when searching.

The main drawback is that you need enough memory to keep all your business objects in RAM and must implements efficient query strategies yourself. No optimizer will help you in Prevaylerland.</description>
		<content:encoded><![CDATA[<p>I&#8217;m currently playing around with Prevayler <a href="http://www.prevayler.org/" rel="nofollow">http://www.prevayler.org/</a> combined with MX or the CDK for handling chemical data.</p>
<p>At the moment I can say that it works as expected at least with smaller datasets (~20000 structures), is easier to program (only Java, no SQL) and a lot (10x-100x) faster when searching.</p>
<p>The main drawback is that you need enough memory to keep all your business objects in RAM and must implements efficient query strategies yourself. No optimizer will help you in Prevaylerland.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on A GPL3 Oracle Cheminformatics Cartridge by Ernst-Georg Schmid</title>
		<link>http://blog.rguha.net/?p=462#comment-5228</link>
		<dc:creator>Ernst-Georg Schmid</dc:creator>
		<pubDate>Thu, 04 Feb 2010 09:01:03 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=462#comment-5228</guid>
		<description>&#62;Oracle has a special process called extproc, which &#62;loads the cartridge DLL and communicates with Oracle &#62;via TCP or IPC.

Point taken, you're absolutely right.

But you can still generate ORA-600s with careless code in the extproc listener or pollute the query cache with broken prepared statements - which technically does not crash the server but still renders it unusable. :-)</description>
		<content:encoded><![CDATA[<p>&gt;Oracle has a special process called extproc, which &gt;loads the cartridge DLL and communicates with Oracle &gt;via TCP or IPC.</p>
<p>Point taken, you&#8217;re absolutely right.</p>
<p>But you can still generate ORA-600s with careless code in the extproc listener or pollute the query cache with broken prepared statements - which technically does not crash the server but still renders it unusable. <img src='http://blog.rguha.net/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Cheminformatics and Non-Relational Datastores by Neil</title>
		<link>http://blog.rguha.net/?p=470#comment-5223</link>
		<dc:creator>Neil</dc:creator>
		<pubDate>Thu, 04 Feb 2010 06:30:38 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=470#comment-5223</guid>
		<description>Andrew is right - for mongodb, the "_id" key is indexed by default, so query by document ID is fast (and _id is always returned by all queries).  Any other key, you add it yourself.  Composite keys, no problem.</description>
		<content:encoded><![CDATA[<p>Andrew is right - for mongodb, the &#8220;_id&#8221; key is indexed by default, so query by document ID is fast (and _id is always returned by all queries).  Any other key, you add it yourself.  Composite keys, no problem.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Cheminformatics and Non-Relational Datastores by Rajarshi Guha</title>
		<link>http://blog.rguha.net/?p=470#comment-5222</link>
		<dc:creator>Rajarshi Guha</dc:creator>
		<pubDate>Thu, 04 Feb 2010 06:27:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rguha.net/?p=470#comment-5222</guid>
		<description>Thanks for the pointers. Happy to see that indexing is an integral feature</description>
		<content:encoded><![CDATA[<p>Thanks for the pointers. Happy to see that indexing is an integral feature</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.268 seconds -->
<!-- Cached page generated by WP-Super-Cache on 2010-02-09 09:10:33 -->
<!-- Compression = gzip -->