Cheminformatics and R at the EBI

Finally all the paper work is in place for me to visit the EBI in May to run a hands-on workshop on doing cheminformatics in R. This is part of a two day program. On May 17th I will be discussing the use of the rcdk and rpubchem packages, considering a variety of use cases such as QSAR modeling and fingerprint similarity. I’ll also be touching on the CDK, as that’s the underlying toolkit. The second day, May 18th, run by Stefan Neumann, Paul Benton and David Broadhurst will focus on handling and analyzing mass spectrometry data in R.

I’m really looking forward to this trip and the chance to discuss these packages as well as get feedback on what’s useful and what’s not. If any readers are attending and would like me to cover specific topics, feel free to drop me a line.

I’ll be reaching London on May 14th, visiting some friends and then heading to Hinxton on May 16th. Another reason I’m looking forward to the trip (in addition to some good fish and chips) is that I get to visit Harrogate, where I grew up.

Simple XML Parsing with Clojure

A while back I had started playing with Clojure. It’s always been a spare-time hobby and not having had much spare time I haven’t really gotten as far ahead with it as I’d have liked. I’m still not sure why I like Clojure, but it is fun to code in. My interest was revitalized when I came across a Clojure group located in the D.C. area. So following on my previous post on geo-referencing PubMed articles, I decided to take a stab at doing the whole thing in Clojure.

One of the tasks in this project is to query PubMed using the EUtils CGIs and parse out the information from the XML document that is returned. It turns out that parsing XML documents or strings is very easy in Clojure.  The parse method in the clojure.xml namespace supports parsing of XML documents, returning a tree of tags. Using xml-zipper from the clojure.zip namespace creates a zipper data structure from the tree. Extracting specific elements is achieved by filtering the zipper by the path to the desired element. It’s a lot like the ElementTree module in Python (but doesn’t require that I insert namespaces before each and every element in the path!). We start of by working in our own namespace and then importing the relevant packages

1
2
3
4
(ns entrez
  (:require [clojure.xml :as xml])
  (:require [clojure.zip :as zip])
  (:require [clojure.contrib.zip-filter.xml :as zf]))

Next we define some helper methods

1
2
3
4
5
6
7
8
9
(defn get-ids [zipper]
  "Extract specific elements from an XML document"
  (zf/xml-> zipper :IdList :Id zf/text))

(defn get-affiliations [zipper]
  "Extract affiliations from PubMed abstracts"
  (map (fn [x y] (list x y))
       (zf/xml-> zipper :PubmedArticle :MedlineCitation :PMID zf/text)
       (zf/xml-> zipper :PubmedArticle :MedlineCitation :Article :Affiliation zf/text)))

Finally, we can get the ID’s from an esearch query by saving the results to a file and then running

1
2
3
(println (get-ids
      (zip/xml-zip
       (xml/parse "esearch.xml"))))

or extract affiliations from a set of PubMed abstracts obtained via an efetch query

1
2
3
(println (get-affiliations
      (zip/xml-zip
       (xml/parse "efetch.xml"))))

In the next post I’ll show some code to actually perform the queries via EUtils so that we don’t need to save results to files.

ChEMBL in RDF and Other Musings

Earlier today, Egon announced the release of an RDF version of ChEMBL, hosted at Uppsala. A nice feature of this setup is that one can play around with the data via SPARQL queries as well as explore the classes and properties that the Uppsala folks have implemented. Having fiddled with SPARQL on and off, it was nice to play with ChEMBL since it contains such a wide array of data types. For example,  find articles referring to an assay (or experiment) run in mice targeting isomerases:

1
2
3
4
5
6
7
8
9
10
PREFIX  chembl:  <http://rdf.farmbio.uu.se/chembl/onto/#>
SELECT DISTINCT ?x ?pmid ?pdesc ?DESC WHERE {
?protein chembl:hasKeyword "Isomerase" .
?x chembl:hasTarget ?protein .
?protein chembl:hasDescription ?pdesc .
?x chembl:organism  "Mus musculus" .
?x chembl:hasDescription ?DESC .
?x chembl:extractedFrom ?resource .
?resource <http://purl.org/ontology/bibo/pmid> ?pmid
}

I’ve been following the discussion on RDF and Semantic Web for some time. While I can see a number of benefits from this approach, I’ve never been fully convinced as to the utility. In too many cases, the use cases I’ve seen (such as the one above) could have been done relatively trivially via traditional SQL queries. There hasn’t been a really novel use case that leads to ‘Aha! So that’s what it’s good for’

Egons’ announcement today, led to a discussion on FriendFeed. I think I finally got the point that SPARQL queries are not magic and could indeed be replaced by traditional SQL. The primary value in RDF is the presence of linked data – which is slowly accumulating in the life sciences (cf. LODD and Bio2RDF).

Of the various features of RDF that I’ve heard about, the ability to define and use equivalence relationships seems very useful. I can see this being used to jump from domain to domain by recognizing properties that are equivalent across domains. Yet, as far as I can tell, this requires that somebody defines these equivalences manually. If we have to do that, one could argue that it’s not really different from defining a mapping table to link two RDBMS’s.

But I suppose in the end what I’d like to see is using all this RDF data to perform automated or semi-automated inferencing. In other words, what non-obvious relationships can be draw from a collection of facts and relationships? In absence of that, I am not necessarily pulling out a novel relationship (though I may be pulling out facts that I did not necessarily know) by constructing a SPARQL query. Is such inferencing even possible?

On those lines, I considered an interesting set of linked data – could we generate a geographically annotated version of PubMed. Essentially, identify a city and country for each PubMed ID. This could be converted to RDF and linked to other sources. One could start asking questions such as are people around me working on a certain topic? or what proteins are the focus of research in region X? Clearly, such a dataset does not require RDF per se. But given that geolocation data is qualitatively different from say UniProt ID’s and PubMed ID’s, it’d be interesting to see whether anything came of this. As a first step, here’s BioPython code to retrieve the Affiliation field from PubMed entries from 2009 and 2010.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from Bio import Entrez

startYear = 2009
endYear = 2010

Entrez.email = "some@email.id"
h = Entrez.esearch(db='pubmed', term='%d:%d[dp]' % (startYear,endYear), retmax=1000000)
records = Entrez.read(h)['IdList']
print 'Got %d records' % (len(records))
o = open('geo.txt', 'w')
for pmid in records:
    print 'Processing PMID %s' % (pmid)
    hf = Entrez.efetch(db='pubmed', id=pmid, retmode='xml', rettype='full')
    details = Entrez.read(hf)[0]
    try:
        aff = details['MedlineCitation']['Article']['Affiliation']
    except KeyError:
        print '%s had no affiliation' % (pmid)
        continue
    try:
        o.write('%s\t%s\n' % (pmid, aff.encode('latin-1')))
    except UnicodeEncodeError:
        'Cant encode for %s' % (pmid)
        continue
o.close()

Using data from the National Geospatial Agency, it shouldn’t be too difficult to link PubMed ID’s to geography.

Molecules & MongoDB – Numbers and Thoughts

In my previous post I had mentioned that key/value or non-relational data stores could be useful in certain cheminformatics applications. I had started playing around with MongoDB and following Rich’s example, I thought I’d put it through its paces using data from PubChem.

Installing MongoDB was pretty trivial. I downloaded the 64 bit version for OS X, unpacked it and then simply started the server process:

1
$MONGO_HOME/bin/mongod --dbpath=$HOME/src/mdb/db

where $HOME/src/mdb/db is the directory in which the database will store the actual data. The simplicity is certainly nice. Next, I needed the Python bindings. With easy_install, this was quite painless. At this point I had everything in hand to start playing with MongoDB.

Getting data

The first step was to get some data from PubChem. This is pretty easy using via their FTP site. I was a bit lazy, so I just made calls to wget, rather than use ftplib. The code below will retrieve the first 80 PubChem SD files and uncompress them into the current directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
import glob, sys, os, time, random, urllib

def getfiles():
    n = 0
    nmax = 80
    for o in urllib.urlopen('ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/').read()
        o = o.strip().split()[5]
        os.system('wget %s/%s' % ('ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/', o))
        os.system('gzip -d %s' % (o))
        n += 1
        sys.stdout.write('Got n = %d, %s\r' % (n,o))
        sys.stdout.flush()
        if n == nmax: return

This gives us a total of 1,641,250 molecules.

Loading data

With the MongoDB instance running, we’re ready to connect and insert records into it. For this test, I simply loop over each molecule in each SD file and create a record consisting of the PubChem CID and all the SD tags for that molecule. In this context a record is simply a Python dict, with the SD tags being the keys and the tag values being the values. Since i know the PubChem CID is unique in this collection I set the special document key “_id” (essentially, the primary key) to the CID. The code to perform this uses the Python bindings to OpenBabel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from openbabel import *
import glob, sys, os
from pymongo import Connection
from pymongo import DESCENDING

def loadDB(recreate = True):
    conn = Connection()
    db = conn.chem
    if 'mol2d' in db.collection_names():
        if recreate:
            print 'Deleting mol2d collection'
            db.drop_collection('mol2d')
        else:
            print 'mol2d exists. Will not reload data'
            return
    coll = db.mol2d

    obconversion = OBConversion()
    obconversion.SetInFormat("sdf")
    obmol = OBMol()

    n = 0
    files = glob.glob("*.sdf")
    for f in files:
        notatend = obconversion.ReadFile(obmol,f)
        while notatend:
            doc = {}
            sdd = [toPairData(x) for x in obmol.GetData() if x.GetDataType()==PairData]
            for entry in sdd:
                key = entry.GetAttribute()
                value = entry.GetValue()
                doc[key] = value
            doc['_id'] = obmol.GetTitle()
            coll.insert(doc)

            obmol = OBMol()
            notatend = obconversion.Read(obmol)

            n += 1
            if n % 100 == 0:
                sys.stdout.write('Processed %d\r' % (n))
                sys.stdout.flush()

    print 'Processed %d molecules' % (n)

    coll.create_index([ ('PUBCHEM_HEAVY_ATOM_COUNT', DESCENDING)  ])
    coll.create_index([ ('PUBCHEM_MOLECULAR_WEIGHT', DESCENDING)  ])

Note that this example loads each molecule on its own and takes a total of 2015.020 sec. It has been noted that bulk loading (i.e., insert a list of documents, rather than individual documents) can be more efficient. I tried this, loading 1000 molecules at a time. But this time round the load time was  2224.691 sec – certainly not an improvement!

Note that the “_id” key is a “primary key’ and thus queries on this field are extremely fast. MongoDB also supports indexes and the code above implements an index on the PUBCHEM_HEAVY_ATOM_COUNT field.

Queries

The simplest query is to pull up records based on CID. I selected 8000 CIDs randomly and evaluated how long it’d take to pull up the records from the database:

1
2
3
4
5
6
7
8
from pymongo import Connection

def timeQueryByCID(cids):
    conn = Connection()
    db = conn.chem
    coll = db.mol2d
    for cid in cids:
        result = coll.find( {'_id' : cid} ).explain()

The above code takes 2351.95 ms, averaged over 5 runs. This comes out to about 0.3 ms per query. Not bad!

Next, lets look at queries that use the heavy atom count field that we had indexed. For this test I selected 30 heavy atom count values randomly and for each value performed the query. I retrieved the query time as well as the number of hits via explain().

1
2
3
4
5
6
7
8
9
10
11
12
13
from pymongo import Connection

def timeQueryByHeavyAtom(natom):
    conn = Connection()
    db = conn.chem
    coll = db.mol2d
    o = open('time-natom.txt', 'w')
    for i in natom:
        c = coll.find( {'PUBCHEM_HEAVY_ATOM_COUNT' : i} ).explain()
        nresult = c['n']
        elapse = c['millis']
        o.write('%d\t%d\t%f\n' % (i, nresult, elapse))
    o.close()

A summary of these queries is shown in the graphs below.

One of the queries is anomalous – there are 93K molecules with 24 heavy atoms, but the query is performed in 139 ms. This might be due to priming while I was testing code.

Some thoughts

One thing that was apparent from the little I’ve played with MongoDB is that it’s extremely easy to use. I’m sure that larger installs (say on a cluster) could be more complex, but for single user apps, setup is really trivial. Furthermore, basic operations like insertion and querying are extremely easy. The idea of being able to dump any type of data (as a document) without worrying whether it will fit into a pre-defined schema is a lot of fun.

However, it’s advantages also seem to be its limitations (though this is not specific to MongoDB). This was also noted in a comment on my previous post. It seems that MongoDB is very efficient for simplistic queries. One of the things that I haven’t properly worked out is whether this type of system makes sense for a molecule-centric database. The primary reason is that molecules can be referred by a variety of identifiers. For example, when searching PubChem, a query by CID is just one of the ways one might pull up data. As a result, any database holding this type of data will likely require multiple indices. So, why not stay with an RDBMS? Furthermore, in my previous post, I had mentioned that a cool feature would be able to dump molecules from arbitrary sources into the DB, without worrying about fields. While very handy when loading data, it does present some complexities at query time. How does one perform a query over all molecules? This can be addressed in multiple ways (registration etc.) but is essentially what must be done in an RDBMS scenario.

Another things that became apparent is the fact that MongoDB and its ilk don’t support JOINs. While the current example doesn’t really highlight this, it is trivial to consider adding say bioassay data and then querying both tables using a JOIN. In contrast, the NoSQL approach is to perform multiple queries and then do the join in your own code. This seems inelegant and a bit painful (at least for the types of applications that I work with).

Finally, one of my interests was to make use of the map/reduce functionality in MongoDB. However, it appears that such queries must be implemented in Javascript. As a result, performing cheminformatics operations (using some other language or external libraries) within map or reduce functions is not currently possible.

But of course, NoSQL DB’s were not designed to replace RDBMS. Both technologies have their place, and I don’t believe that one is better than the other. Just that one might be better suited to a given application than the other.

Cheminformatics and Non-Relational Datastores

Over the past year or so I’ve been seeing a variety of non-relational data stores coming up. They also go by terms such as document databases or key/value stores (or even NoSQL databases). These systems are alternatives to traditional RDBMS’s in that they do not require explicit schema defined a priori. While they do not offer transactional guarantees (ACID) compared to RDBMS’s, they claim flexibility, speed and scalability. Examples include CouchDB, MongoDB and Tokyo Cabinet. Pierre and Brad have described some examples of using CouchDB with bioinformatics data and Rich has started a series on the use of CouchDB to store PubChem data.

Having used RDBMS’s such as PostgreSQL and Oracle for some time, I’ve wondered how or why one might use these systems for cheminformatics applications. Rich’s posts describe how one might go about using CouchDB to store SD files, but it wasn’t clear to me what advantage it provided over say, PostgreSQL.

I now realize that if you wanted to store arbitrary chemical data from multiple sources a document oriented database makes life significantly easier compared to a traditional RDBMS. While Rich’s post considers SD files from PubChem (which will have the same set of SD tags), CouchDB and its ilk become really useful when one considers, say, SD files from arbitrary sources. Thus, if one were designing a chemical registration system, the core would involve storing structures and an associated identifier. However, if the compounds came with arbitrary fields attached to them, how can we easily and efficiently store them? It’s certainly doable via SQL (put each field name into ‘dictionary’ table etc) but it seems a little hacky.

On the other hand, one could trivially transform an SD formatted structure to a JSON-like document and then dump that into CouchDB. In other words, one need not worry about updating a schema. Things become more interesting when storing associated non-structural data – assays, spectra and so on. When I initially set up the IU PubChem mirror, it was tricky to store all the bioassay data since the schema for assays was not necessarily identical. But I now see that such a scenario is perfect for a document oriented database.

However some questions still remain. Most fundamentally, how does not having a schema affect query performance? Thus if I were to dump all compounds in PubChem into CouchDB, pulling out details for a given compound ID should be very fast. But what if I wanted to retrieve compounds with a molecular weight less than 250? In a traditional RDBMS, the molecular weight would be a column, preferably with an index. So such queries would be fast. But if the molecular weight is just a document property, it’s not clear that such a query would (or could) be very fast in a document oriented DB (would it require linear scans?). I note that I haven’t RTFM so I’d be happy to be corrected!

However I’d expect that substructure search performance wouldn’t differ much between the two types of database systems. In fact, with the map/reduce features of CouchDB and MongoDB, such searches could in fact be significantly faster (though Oracle is capable of parallel queries).This also leads to the interesting topic of how one would integrate cheminformatics capabilities into a document-oriented DB (akin to a cheminformatics cartridge for an RDBMS).

So it looks like I’m going to have to play around and see how all this works.