Today I got an email asking whether it’d be possible to speed up a fingerprint similarity matrix calculation in R. Now, pairwise similarity matrix calculations (whether they’re for molecules or sequences or anything else) are by definition quadratic in nature. So performing these calculations for large collections aren’t always feasible – in many cases, it’s […]
Pig and Cheminformatics
Pig is a platform for analyzing large datasets. At its core is a high level language (called Pig Latin), that is focused on specifying a series of data transformations. Scripts written in Pig Latin are executed by the Pig infrastructure either in local or map/reduce modes (the latter making use of Hadoop). Previously I had […]
Hadoop and Atom Counting
Over the past few months I’ve been hacking together scripts to distribute data parallel jobs. However, it’s always nice when somebody else has done the work. In this case, Hadoop is an implementation of the map/reduce framework from Google. As Yahoo and others have shown, it’s an extremely scalable framework, and when coupled with Amazons […]
Easy Parallelization With Clojure
As I noted in my previous post, one of the nice features of Clojure is its support for concurrent programming. Now, it provides some fancy features that allow one to write complex parallel programs. I’m certainly no expert on that topic. However, one thing that I do everyday is perform operations on elements of a […]
Multi-threaded Database Access with Python
Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single […]