Similarity Matrices in Parallel

Today I got an email asking whether it’d be possible to speed up a fingerprint similarity matrix calculation in R. Now, pairwise similarity matrix calculations (whether they’re for molecules or sequences or anything else) are by definition quadratic in nature. So performing these calculations for large collections aren’t always feasible – in many cases, it’s […]

Pig and Cheminformatics

Pig is a platform for analyzing large datasets. At its core is a high level language (called Pig Latin), that is focused on specifying a series of data transformations. Scripts written in Pig Latin are executed by the Pig infrastructure either in local or map/reduce modes (the latter making use of Hadoop). Previously I had […]

Hadoop and Atom Counting

Over the past few months I’ve been hacking together scripts to distribute data parallel jobs. However, it’s always nice when somebody else has done the work. In this case, Hadoop is an implementation of the map/reduce framework from Google. As Yahoo and others have shown, it’s an extremely scalable framework, and when coupled with Amazons […]

Easy Parallelization With Clojure

As I noted in my previous post, one of the nice features of Clojure is its support for concurrent programming. Now, it provides some fancy features that allow one to write complex parallel programs. I’m certainly no expert on that topic. However, one thing that I do everyday is perform operations on elements of a […]

Multi-threaded Database Access with Python

Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single […]

So much to do, so little time

Trying to squeeze sense out of chemical data

Similarity Matrices in Parallel

Pig and Cheminformatics

Hadoop and Atom Counting

Easy Parallelization With Clojure

Multi-threaded Database Access with Python