So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘python’ tag

Python, permission and forgiveness

with one comment

This morning I was writing some Python code that needed to perform lookups on a very large map.

1
2
3
4
mapSize = 65000
amap = {}
for i in range(0,mapSize):
    amap['k%d' % (i)] = i

If a key did not exist in the map I needed to take some action. My initial effort performed a pre-emptive check to see whether the key existed in the map

1
2
3
4
5
6
7
8
9
query = ['k%d' % (x) for x in xrange(100)]
query.extend(['k%d' % (x) for x in xrange(mapSize+1, mapSize+100)])

def permission():
    for q in query:
        if q in amap.keys():
            val = amap[q]
        else:
            pass

Looking up 200 keys (half of which are present and half  are absent from the map) took 496 ms (based on Pythons’ timeit module with default settings). On the other hand, if we just try and go ahead and access the key and deal with its absence by handling with the exception, we get a significant improvement.

1
2
3
4
5
6
def forgiveness():
    for q in query:
        try:
            val = amap[q]
        except:
            pass

Specifically, 150 us – an improvement of 3306x.

So, as with many other things in life, it’s better to ask Python for forgiveness than  permission.

Written by Rajarshi Guha

February 7th, 2013 at 4:19 pm

Posted in software

Tagged with

Automating ChemDraw

without comments

I’ve been working on a project in which I needed to generate logP values using ChemDraw 12, for thousands of molecules. Since I didn’t have access to the ChemScript module, I needed a way to automate this procedure. After fiddling around with Visual Studio and various debuggers, I came across the Windows Application Testing Using Python (WATSUP). This is a set of Python methods, built on top of the win32api package that allows one to interact with a Windows GUI programmatically. Thus one can identify a top level window, get a specific control (button, combo box etc) and click on buttons, menu items and so on.

After a little digging around (and liberal use of the dumpWindow() function in WATSUP), I was able to put together a simple bit of code that would load an SDF (containing a single structure) and save it as a CDXML file. For this to work, I make sure that the ChemDraw application is running and the “Chemical Properties” window is visible. On loading an SDF, the chemical properties (stuff like logP, MR, melting point etc) get computed automatically. We then “click” the paste button and then save to CDXML format. In the resultant CDXML file, the chemical property values are included – which can then be easily extracted using a regex. Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import win32com.client
import win32api
import sys, os, time, datetime, glob, pprint
from watsup.winGuiAuto import *
from watsup.launcher import launchApp,terminateApp

def getButtonControl(toplevel, buttonText):
    elems = dumpWindow(toplevel)
    openButtonHwnd = None
    for elem in elems:
        if len(elem) != 3: continue
        hwnd, name, klass = elem
        if name == buttonText and klass == 'Button':
            openButtonHwnd = hwnd
            break
    openButton = findControls(toplevel, wantedClass="Button", wantedText=buttonText)
    openButton = openButton[openButton.index(openButtonHwnd)]
    return openButton

def setFileName(toplevel, filename):
    fnameCombo = findControl(toplevel, wantedClass='ComboBoxEx32')
    fnameComboEdit = findControl(fnameCombo, wantedClass = "Edit")
    setEditText(fnameComboEdit, filename)
       
filename = "c:\\work\\test.sdf"

form=findTopWindow(wantedText='ChemBioDraw Ultra')
mainMenu = getTopMenu(form)

# click the File->Open menu option
activateMenuItem(form, ('file', 'open'))
time.sleep(0.5)

## OK, we get the file selection window
openWindow = findTopWindow(wantedText='Open')

## get the file type combo box and select the SDF format
ftypeCombo = findControls(openWindow, wantedClass = "ComboBox")    
if len(getComboboxItems(ftypeCombo[0])) > 10: ftypeCombo = ftypeCombo[0]
elif len(getComboboxItems(ftypeCombo[1])) > 10: ftypeCombo = ftypeCombo[1]
else: ftypeCombo = ftypeCombo[2]
selectComboboxItem(ftypeCombo, 14)

## get the filename combo box and set it
setFileName(openWindow, filename)

## get the open button, click it and get the file
openButton = getButtonControl(openWindow, "&Open")
clickButton(openButton)
time.sleep(1)

## get the properties window and then paste in the
## chemical properties that are autocalculated
propWindow = findTopWindow(wantedText = "Chemical Properties", retryInterval = 0.1, maxWait = 5)
pasteButton = findControl(propWindow, wantedClass="Button", wantedText="Paste")
clickButton(pasteButton)

## now save the file as a cdxml
newfilename = getOutputfileName(filename)
activateMenuItem(form, ("file", "save as"))
time.sleep(0.5)
saveWindow = findTopWindow(wantedText='Save As')
setFileName(saveWindow, newfilename)

## set file type
ftypeCombo = findControls(saveWindow, wantedClass = "ComboBox")
if len(getComboboxItems(ftypeCombo[0])) > 10: ftypeCombo = ftypeCombo[0]
elif len(getComboboxItems(ftypeCombo[1])) > 10: ftypeCombo = ftypeCombo[1]
else: ftypeCombo = ftypeCombo[2]
selectComboboxItem(ftypeCombo, 1)

## get the save button and click it to save the file
saveButton = getButtonControl(saveWindow, "&Save")
clickButton(saveButton)
time.sleep(1)

## looks like we have to do a save and then we close
activateMenuItem(form, ['file', 'save'])
activateMenuItem(form, ['file', 'close'])

Obviously this approach is an inelegant hack and is dog slow – approximately 2.5 sec per structure. But in the absence of anything else, it gets the job done.

While implementing this solution there were a few quirks. For example, the widgets contained with in window, represent a hierarchy. The findControls() method traverses this hierarchy, but does not return the controls of the specific type in the same order on consecutive runs. So to find the appropriate combo box (say for the file type) I need to do some extra work (rather than just going with the first of three combo boxes that are located on a Open dialog). One contributing factor to the slowness is that I needed to insert a few sleep statements here and there, to ensure that the proper windows showed up before I started setting values in the various widgets. Finally, for some reason I had to do a “Save As” followed by a “Save” to get the final CDXML file with all the computed properties.

Written by Rajarshi Guha

April 13th, 2010 at 6:21 pm

Posted in software

Tagged with , , , , ,

Molecules & MongoDB – Numbers and Thoughts

with 6 comments

In my previous post I had mentioned that key/value or non-relational data stores could be useful in certain cheminformatics applications. I had started playing around with MongoDB and following Rich’s example, I thought I’d put it through its paces using data from PubChem.

Installing MongoDB was pretty trivial. I downloaded the 64 bit version for OS X, unpacked it and then simply started the server process:

1
$MONGO_HOME/bin/mongod --dbpath=$HOME/src/mdb/db

where $HOME/src/mdb/db is the directory in which the database will store the actual data. The simplicity is certainly nice. Next, I needed the Python bindings. With easy_install, this was quite painless. At this point I had everything in hand to start playing with MongoDB.

Getting data

The first step was to get some data from PubChem. This is pretty easy using via their FTP site. I was a bit lazy, so I just made calls to wget, rather than use ftplib. The code below will retrieve the first 80 PubChem SD files and uncompress them into the current directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
import glob, sys, os, time, random, urllib

def getfiles():
    n = 0
    nmax = 80
    for o in urllib.urlopen('ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/').read()
        o = o.strip().split()[5]
        os.system('wget %s/%s' % ('ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/', o))
        os.system('gzip -d %s' % (o))
        n += 1
        sys.stdout.write('Got n = %d, %s\r' % (n,o))
        sys.stdout.flush()
        if n == nmax: return

This gives us a total of 1,641,250 molecules.

Loading data

With the MongoDB instance running, we’re ready to connect and insert records into it. For this test, I simply loop over each molecule in each SD file and create a record consisting of the PubChem CID and all the SD tags for that molecule. In this context a record is simply a Python dict, with the SD tags being the keys and the tag values being the values. Since i know the PubChem CID is unique in this collection I set the special document key “_id” (essentially, the primary key) to the CID. The code to perform this uses the Python bindings to OpenBabel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from openbabel import *
import glob, sys, os
from pymongo import Connection
from pymongo import DESCENDING

def loadDB(recreate = True):
    conn = Connection()
    db = conn.chem
    if 'mol2d' in db.collection_names():
        if recreate:
            print 'Deleting mol2d collection'
            db.drop_collection('mol2d')
        else:
            print 'mol2d exists. Will not reload data'
            return
    coll = db.mol2d

    obconversion = OBConversion()
    obconversion.SetInFormat("sdf")
    obmol = OBMol()

    n = 0
    files = glob.glob("*.sdf")
    for f in files:
        notatend = obconversion.ReadFile(obmol,f)
        while notatend:
            doc = {}
            sdd = [toPairData(x) for x in obmol.GetData() if x.GetDataType()==PairData]
            for entry in sdd:
                key = entry.GetAttribute()
                value = entry.GetValue()
                doc[key] = value
            doc['_id'] = obmol.GetTitle()
            coll.insert(doc)

            obmol = OBMol()
            notatend = obconversion.Read(obmol)

            n += 1
            if n % 100 == 0:
                sys.stdout.write('Processed %d\r' % (n))
                sys.stdout.flush()

    print 'Processed %d molecules' % (n)

    coll.create_index([ ('PUBCHEM_HEAVY_ATOM_COUNT', DESCENDING)  ])
    coll.create_index([ ('PUBCHEM_MOLECULAR_WEIGHT', DESCENDING)  ])

Note that this example loads each molecule on its own and takes a total of 2015.020 sec. It has been noted that bulk loading (i.e., insert a list of documents, rather than individual documents) can be more efficient. I tried this, loading 1000 molecules at a time. But this time round the load time was  2224.691 sec – certainly not an improvement!

Note that the “_id” key is a “primary key’ and thus queries on this field are extremely fast. MongoDB also supports indexes and the code above implements an index on the PUBCHEM_HEAVY_ATOM_COUNT field.

Queries

The simplest query is to pull up records based on CID. I selected 8000 CIDs randomly and evaluated how long it’d take to pull up the records from the database:

1
2
3
4
5
6
7
8
from pymongo import Connection

def timeQueryByCID(cids):
    conn = Connection()
    db = conn.chem
    coll = db.mol2d
    for cid in cids:
        result = coll.find( {'_id' : cid} ).explain()

The above code takes 2351.95 ms, averaged over 5 runs. This comes out to about 0.3 ms per query. Not bad!

Next, lets look at queries that use the heavy atom count field that we had indexed. For this test I selected 30 heavy atom count values randomly and for each value performed the query. I retrieved the query time as well as the number of hits via explain().

1
2
3
4
5
6
7
8
9
10
11
12
13
from pymongo import Connection

def timeQueryByHeavyAtom(natom):
    conn = Connection()
    db = conn.chem
    coll = db.mol2d
    o = open('time-natom.txt', 'w')
    for i in natom:
        c = coll.find( {'PUBCHEM_HEAVY_ATOM_COUNT' : i} ).explain()
        nresult = c['n']
        elapse = c['millis']
        o.write('%d\t%d\t%f\n' % (i, nresult, elapse))
    o.close()

A summary of these queries is shown in the graphs below.

One of the queries is anomalous – there are 93K molecules with 24 heavy atoms, but the query is performed in 139 ms. This might be due to priming while I was testing code.

Some thoughts

One thing that was apparent from the little I’ve played with MongoDB is that it’s extremely easy to use. I’m sure that larger installs (say on a cluster) could be more complex, but for single user apps, setup is really trivial. Furthermore, basic operations like insertion and querying are extremely easy. The idea of being able to dump any type of data (as a document) without worrying whether it will fit into a pre-defined schema is a lot of fun.

However, it’s advantages also seem to be its limitations (though this is not specific to MongoDB). This was also noted in a comment on my previous post. It seems that MongoDB is very efficient for simplistic queries. One of the things that I haven’t properly worked out is whether this type of system makes sense for a molecule-centric database. The primary reason is that molecules can be referred by a variety of identifiers. For example, when searching PubChem, a query by CID is just one of the ways one might pull up data. As a result, any database holding this type of data will likely require multiple indices. So, why not stay with an RDBMS? Furthermore, in my previous post, I had mentioned that a cool feature would be able to dump molecules from arbitrary sources into the DB, without worrying about fields. While very handy when loading data, it does present some complexities at query time. How does one perform a query over all molecules? This can be addressed in multiple ways (registration etc.) but is essentially what must be done in an RDBMS scenario.

Another things that became apparent is the fact that MongoDB and its ilk don’t support JOINs. While the current example doesn’t really highlight this, it is trivial to consider adding say bioassay data and then querying both tables using a JOIN. In contrast, the NoSQL approach is to perform multiple queries and then do the join in your own code. This seems inelegant and a bit painful (at least for the types of applications that I work with).

Finally, one of my interests was to make use of the map/reduce functionality in MongoDB. However, it appears that such queries must be implemented in Javascript. As a result, performing cheminformatics operations (using some other language or external libraries) within map or reduce functions is not currently possible.

But of course, NoSQL DB’s were not designed to replace RDBMS. Both technologies have their place, and I don’t believe that one is better than the other. Just that one might be better suited to a given application than the other.

Written by Rajarshi Guha

February 8th, 2010 at 2:18 am

Frequency of a Term via PubMed

with 6 comments

A little while back, Egon posted a question on FriendFeed, asking whether there was an easy way, preferably a service, to determine and plot the usage count of a term in PubMed by year. This is simple enough using the Entrez Utilities CGI. A quick Python script to do this (with minimal error checking) is given below. It’d be relatively trivial to wrap this as a mod_python application and generate a bar plot directly (either using Python or using one of the online charting API’s)

1
2
3
4
5
6
7
8
9
10
11
12
13
import urllib
import xml.etree.ElementTree as ET

u = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?term=%s&mindate=%d/01/01&maxdate=%d/12/31"
term = "artemisinin resistance"
startYear = 1998
endYear = 2009
for year in range(startYear, endYear+1):
    url = u % (term.replace(" ", "+"), year, year)
    page = urllib.urlopen(url).read()
    doc = ET.XML(page)
    count = doc.find("Count").text
    print year, count

Update 1

A little more hacking and the above code was converted to a mod_python application, which can be accessed using a URL of the form http://rest.rguha.net/usage/usage.py?term=TERM&syear=1997&eyear=2009. With the help of the handy pygooglechart module, the above URL returns an <img> tag containing the appropriate Google Charts URL. As a an example, the term “artemisinin resistance” results in this image.

Update 2

Jan Schoones pointed out in a comment that my artemisinin resistance example was slightly incorrect, as the resultant PubMed search does not search for the exact phrase, but rather, looks for documents that contain the words “artemisinin” and “resistance”. This is because the example URL does not include the quotes around the phrase. A more correct example would be here, where we search for the phrase, rather than individual words.

Written by Rajarshi Guha

November 10th, 2009 at 11:50 pm

Posted in software

Tagged with , ,

Preprocessing ONS Solubility Data

without comments

With the semester winding up and preparing to move to Rockville, things have been a bit hectic. However, I’ve been trying to keep track of the ONS solubility modeling project and one of the irritating things is that each time I want to build a model (due to new data being submitted), I need to extract and clean the data from the Google spreadsheet. So, finally put together some Python code to get the solubility data, clean it up, filter out invalid rows (as noted by the DONOTUSE string) and optionally filter rows based on a specified string. This allows me to get the whole dataset at one go, or just the data for methanol etc. Note that it doesn’t dump all the columns from the original spreadsheet – just the columns I need for modeling. A very simplistic script that dumps the final data in tab-delimited format to STDOUT.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
## Rajarshi Guha
## 04/14/2009
## Update 04/20/2009 - include solute state column

import urllib
import csv
import getopt
import sys

def usage():
    print """
Usage: solsum.py [OPTIONS]

Retrieves the SolubilitiesSum spreadsheet from Google and
processes Sheet 1 to extract solubility data. Right now it just
pulls out the solute name and smiles, solvent name and solubility.
It will filter out entries that are marked as DONOTUSE. If desired
you can aggregate solubility values for multiple instances of the
same compound using a variety of functions.

The final data table is output as tab separated onto STDOUT.

OPTIONS can be

-h, --help            This message
-a, --agg FUNC        Aggregate function. Valid values can be
                      'mean', 'median', 'min', 'max'.
                      Currently this is not supported
-f, --filter STRING   Only include rows that have a column that exactly
                      matches the specified STRING
-d, --dry             Don't output data, just report some stats
"""


url = 'http://spreadsheets.google.com/pub?key=plwwufp30hfq0udnEmRD1aQ&output=csv&gid=0'

idx_solute_name = 3
idx_solute_smiles = 4
idx_solvent_name = 5
idx_conc = 7
idx_state = 24

if __name__ == '__main__':

    fstring = None
    agg = None
    dry = False

    solubilities = []

    try:
        opts, args = getopt.getopt(sys.argv[1:], "hdf:a:", ["help", "dry", "filter=", "agg="])
    except getopt.GetoptError:
        usage()
        sys.exit(-1)

    for opt, arg in opts:
        if opt in ('-h', '--help'):
            usage()
            sys.exit(1)
        elif opt in ('-f', '--filter'):
            fstring = arg
        elif opt in ('-d', '--dry'):
            dry = True

    data = csv.reader(urllib.urlopen(url))
    data.next()
    c = 2
    for row in data:
        line = [x.strip() for x in row]
        if len(line) != 25:
            print 'ERROR (Line %d) %s' % (c, ','.join(line))
            continue
        solubilities.append( (line[idx_solute_name],
                              line[idx_solute_smiles],
                              line[idx_solvent_name],
                              line[idx_conc],
                              line[idx_state]) )
        c += 1

    if dry:
        print 'Got %d entries' % (len(solubilities))
    solubilities = [ x for x in solubilities if x[0].find("DONOTUSE") == -1]
    if dry:
        print 'Got %d entries after filtering invalid entries' % (len(solubilities))

    if not dry:
        for row in solubilities:
            if fstring:
                if any(map(lambda x: x == fstring, row)):
                    print '\t'.join([str(x) for x in row])
                    continue
            else:
                print '\t'.join([str(x) for x in row])

Written by Rajarshi Guha

April 14th, 2009 at 8:40 pm

Posted in software

Tagged with ,