So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘software’ Category

Python, permission and forgiveness

with one comment

This morning I was writing some Python code that needed to perform lookups on a very large map.

1
2
3
4
mapSize = 65000
amap = {}
for i in range(0,mapSize):
    amap['k%d' % (i)] = i

If a key did not exist in the map I needed to take some action. My initial effort performed a pre-emptive check to see whether the key existed in the map

1
2
3
4
5
6
7
8
9
query = ['k%d' % (x) for x in xrange(100)]
query.extend(['k%d' % (x) for x in xrange(mapSize+1, mapSize+100)])

def permission():
    for q in query:
        if q in amap.keys():
            val = amap[q]
        else:
            pass

Looking up 200 keys (half of which are present and half  are absent from the map) took 496 ms (based on Pythons’ timeit module with default settings). On the other hand, if we just try and go ahead and access the key and deal with its absence by handling with the exception, we get a significant improvement.

1
2
3
4
5
6
def forgiveness():
    for q in query:
        try:
            val = amap[q]
        except:
            pass

Specifically, 150 us – an improvement of 3306x.

So, as with many other things in life, it’s better to ask Python for forgiveness than  permission.

Written by Rajarshi Guha

February 7th, 2013 at 4:19 pm

Posted in software

Tagged with

New version of fingerprint (3.4.9) – faster Dice similarity matrices

without comments

I’ve just pushed a new version of the fingerprint package that contains an update provided by Abhik Seal that significantly speeds up calculation of pairwise similarity matrices when using the Dice similarity method. A ran a simple comparison using different numbers of random fingerprints (1024 bits, with 512 bits set to one, randomly) and measured the time to evaluate the pairwise similarity matrix. As you can see from the figure alongside, the new code is significantly faster (with speed ups of 450x to 500x). The code to generate the timings is below – it probably should wrapped in a loop to multiple times for each set size.

1
2
3
4
5
fpls <- lapply(seq(10,300,by=10),
               function(i) sapply(1:i,
                                  function(x) random.fingerprint(1024, 512)))
times <- sapply(fpls,
                function(fpl) system.time(fp.sim.matrix(fpl, method='dice'))[3])

Written by Rajarshi Guha

October 30th, 2012 at 11:10 pm

Chunking lists in R

without comments

A common task for is to run database queries on gene symbols or compound identifiers. This involves constructing an SQL query as a string and sending that off to the database. In the case of the ROracle package, the query strings are limited to a 1000 (?) or so characters. This means that directly querying for a thousand identifiers won’t work. And going through the list of identifiers one at a time is inefficient. What we need in this situation is a to “chunk” the list (or vector) of identifiers and work on individual chunks. With the help of the itertools package, this is very easy:

1
2
3
4
5
6
7
8
library(itertools)
n <- 1:11
chunk.size <- 3
it <- ihasNext(ichunk(n, chunk.size))
while (itertools::hasNext(it)) {
  achunk <- unlist(nextElem(it))
  print(achunk)
}

Written by Rajarshi Guha

July 5th, 2012 at 2:22 pm

Posted in software

Tagged with , ,

Software for the “Federation of Independent Scientists”

without comments

A few days back, Derek Lowe posted a comment from a reader who suggested a way to approach the current employment challenges in the pharmaceutical industry would be the formation of a Federation of Independent Scientists. Such a federation would be open to consultants, small companies etc and would use its size to obtain group rates on various things – journal access, health insurance and so on. Obviously, there’s a lot of details left out here and when you go in the nitty gritty a lot of issues arise that don’t have simple answers. Nevertheless, an interesting (and welcome, as evidenced by the comment thread) idea.

One aspect raised by a commenter was access to modeling and docking software by such a group. He mentioned that he’d

… like to see an open source initiative develop a free, open source drug discovery package.Why not, all the underlying force fields and QM models have been published … it would just take a team of dedicated programmers and computational chemists time and passion to create it.

This is the very essence of the Blue Obelisk movement, under whose umbrella there is now a wide variety of computation chemistry and cheminformatics software. There’s certainly no lack of passion in the Open Source chemistry software community. As most of it is based on volunteer effort, time is always an issue. This has a direct effect on the features provided by Open Source chemistry software – such software does not always match up to commercial tools. But as the commenter above pointed out, much of the algorithms underlying proprietrary software is published. It just needs somebody with the time and expertise to implement them. And the combination of these two (in the absence of funding) is not always easy to find.

Of course, having access to the software is just one step. A scientists requires (possibly significant) hardware resources to run the software. Another comment raised this issue and asked about the possibility of a cloud based install of comp chem software.

With regards the sophisticated modelling tools – do they have to be locally installed?

How do the big pharma companies deploy the software now? I would be very suprised if it wasn’t easily packaged, although I guess the number of people using it is limited.

I’m thinking of some kind of virtual server, or remote desktop style operation. Your individual contractor can connect from whereever, and have full access to a range of tools, then transfer their data back to their own location for safekeeping.

Unlike CloudBioLinux, which provides a collection of bioinformatics and structural biology software as a prepackaged AMI for Amazons EC2 platform, I’m not aware of a similarly prepackaged set of Open Source tools for chemistry. And certainly not based on the cloud. (There are some companies that host comp chem software on the cloud and provide access to these installations for a fee). While some Linux distribibutions do package a number of scientific packages (UbuntuScience for example), I don’t think that these would support a computational drug discovery operation. (The above comment does’nt necessarily focus just on Open Source software. One could consider commercial software hosted on remote servers, though I wonder what type of licensing would be involved).

The last component would be the issue of data, primarily for cloud based solutions. While compute cycles on such platforms are usually cheap, bandwidth can be expensive. Granted, chemical data is not as big as biological data (cf. 1000Genomes on AWS), but sending a large collection of conformers over the network may not be very cost-effective. One way to bypass this would be to generate “standard” conformer collections and other such libraries and host them on the cloud. But what is “standard” and who would pay for hosting costs is an open question.

But I do think there is a sufficiently rich ecosystem of Open Source software that could serve much of the computational needs of a “Federation of Independent Scientists”. It’d be interesting to put together a list of Open Source based on requirements from the the commenters in that thread.

Written by Rajarshi Guha

April 14th, 2012 at 9:23 pm

“Type-ahead” substructure searches

with one comment

The other day I was exchanging emails with John Van Drie regarding open challenges in cheminformatics (which I’ll say more about later). One of his comments concerned the slow speed of chemical searches

Google searches are screamingly fast, so fast that the type-ahead feature is doing the search as you key characters in.  Why are all chemical searches so sloooow? … Ideally, as you sketch your mol in, the searches should be happening at the same pace, like the typeahead feature.

Now, he doesn’t specifically mention what type of chemical search – it could be exact matches, similarity searches, substructure or pharmacophore searches. The first two can be done very quickly and lend themselves easily to type ahead type search interfaces. In light of the work my colleague has been doing, the substructure searches are now also amenable to a type ahead interface.

So I quickly put together a simple web page that lets you type in a SMILES (or SMARTS) and as you type it retrieves the results of a substructure search via the NCTT Search Server REST API. (In some cases the depiction is broken – that’s a bug on my side). Of course, typing in SMILES is not the most intuitive of interfaces. Since Trung employs the ChemDoodle sketcher, an ideal interface would respond to drawing events (say drawing a bond or adding atoms etc) and pull up matches on the fly. Another obvious extension is to rank (or filter) the results – all the while, maintaining the near real time speed of the application.

As I said before, seriously fast substructure searches. It also helps that I can build these examples via a public REST API. I’m sure there are reasons for SOAP, XML and so on. But it’s 2011. So lets help make extensions and mashups easier.

UPDATE: Yes, it’s easy to create patterns (especially with SMARTS) that DoS the server. We have some filters for excessively generic patterns; so some queries may not behave in the expected manner

Written by Rajarshi Guha

November 28th, 2011 at 4:07 pm