So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for September, 2010

The CDK is 10 Years Old

with 2 comments

As Egon has pointed out, the CDK project started 10 years ago today tomorrow – congratulations to everybody involved in the project. But also, Egon deserves a huge vote of thanks for keeping the project going – not only in terms of code contributions but also the “grunt” work such as releases, bug fixes, documentation and so on. Which might be boring, but are vital to making the library usable.

I started hacking on the CDK around 2004 (as a way to learn Java!) and while it still does have rough edges, it’s become quite capable. As Egon points out, we’ve come a long way, but it’s still a completely volunteer project. And so things like bug fixes and feature requests take time to get resolved (and some are still languishing for months or years). Certainly, external funding would be great, (I still think that the NIH was short sighted in not funding our software development grant), but in absence of that we’ll still forge ahead in our free time. I liked Egons list of things to happen in 2011 and I’d love to see the implementation of some force fields – UFF is a start, MMFF94 would be really useful – as that’d begin to address one of my wishlist items, viz., GRID decsriptors.

But anyway, today tomorrow we  pop some champagne (virtually)

Written by Rajarshi Guha

September 26th, 2010 at 1:47 pm

Posted in cheminformatics,software

Tagged with

New Versions of rcdk and rcdklibs

with 2 comments

I’ve put released an update to rcdk and rcdklibs on CRAN – right now source packages are available, but binary ones should show up soon. Both packages should be updated together. These packages integrate the CDK into the R environment and simplifies a number of cheminformatics tasks. These versions used CDK 1.3.6 and JCP 16, so we now get access to SMSD and a few new descriptors.. In addition a some new methods have been included

  • cdk.version to get the version of the CDK being used by the package
  • is.subgraph uses SMSD to identify substructures. Similar to the pre-exisiting matches method, but much faster in general (though you cannot specify SMARTS)
  • get.mcs again, wraps SMSD and allows one to retrieve the MCS (as a molecule object or as atom indices) for a pair of molecules. Once again, should be pretty fast

Written by Rajarshi Guha

September 25th, 2010 at 10:05 pm

Posted in cheminformatics,software

Tagged with , ,

Author Count Frequencies in PubMed

without comments

Earlier today, Emily Wixson posted a question on the CHMINF-L list asking

… if there is any way to count the number of authors of papers with specific keywords in the title by year over a decade …

Since I had some code compiling and databases loading I took a quick stab, using Python and the Entrez services. The query provided by Emily was

1
(RNA[Title] OR "ribonucleic acid"[Title]) AND ("2009"[Publication Date] : "2009"[Publication Date])

The Python code to retrieve all the relevant PubMed ID’s and then process the PubMed entries to extract the article ID, year and number of authors is below. For some reason this query also retrieves articles from before 2001 and articles with no year or zero authors, but we can easily filter those entries out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import urllib, urllib2, sys
import xml.etree.ElementTree as ET

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))

query = '(RNA[Title] OR "ribonucleic acid"[Title]) AND ("2009"[Publication Date] : "2009"[Publication Date])'

esearch = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&mindate=2001&maxdate=2010&retmode=xml&retmax=10000000&term=%s' % (query)
handle = urllib.urlopen(esearch)
data = handle.read()

root = ET.fromstring(data)
ids = [x.text for x in root.findall("IdList/Id")]
print 'Got %d articles' % (len(ids))

for group in chunker(ids, 100):
    efetch = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?&db=pubmed&retmode=xml&id=%s" % (','.join(group))
    handle = urllib.urlopen(efetch)
    data = handle.read()

    root = ET.fromstring(data)
    for article in root.findall("PubmedArticle"):
        pmid = article.find("MedlineCitation/PMID").text
        year = article.find("MedlineCitation/Article/Journal/JournalIssue/PubDate/Year")
        if year is None: year = 'NA'
        else: year = year.text
        aulist = article.findall("MedlineCitation/Article/AuthorList/Author")
        print pmid, year, len(aulist)

With ID’s, year and author counts in hand, a bit of R lets us visualize the distribution of author counts, over the whole decade and also by individual years.

The median author count is 4, There are a number of papers that have more than 15 and some single papers with more than 35 authors on them. If we exclude papers with, say more than 20 authors and view the distribution by year we get the following set of histograms

We can see that over the years, while the median number of authors on a paper is more or less constant at 4 and increases to 5 in 2009 and 2010. But at the same time, the distribution does grow broader over the years, indicating that there is an increasing number of papers with larger author counts.

Anyway, this was a quick hack, and there are probably more rigorous ways to do this (such as using Web of Science – but automating that would be painful).

Written by Rajarshi Guha

September 15th, 2010 at 2:19 am

Posted in software

Tagged with , , ,

Pig and Cheminformatics

without comments

Pig is a platform for analyzing large datasets. At its core is a high level language (called Pig Latin), that is focused on specifying a series of data transformations. Scripts written in Pig Latin are executed by the Pig infrastructure either in local or map/reduce modes (the latter making use of Hadoop).

Previously I had investigated Hadoop for running cheminformatics tasks such as SMARTS matching and pharmacophore searching. While the implementation of such code is pretty straightforward, it’s still pretty heavyweight compared to say, performing SMARTS matching in a database via SQL. On the other hand, being able to perform these tasks in Pig Latin, lets us write much simpler code that can be integrated with other non-cheminformatics code in a flexible manner. An example of Pig Latin script that we might want to execute is:

1
2
3
A = load 'medium.smi' as (smiles:chararray);
B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');
store B into 'output.txt';

The script loads a file containing SMILES strings and then filters entries that match the specified SMARTS pattern and writes out the matching SMILES to an output file. Clearly, very similar to SQL. However, the above won’t work on a default Pig installation since SMATCH is not a builtin function. Instead we need to look at  a user defined function (UDF).

UDF’s are implemented in Java and can be classified into one of three types: eval, aggregate or filter functions. For this example I’ll consider a filter function that takes two strings representing a SMILES string and a SMARTS string and returns true if the SMILES contains the specified pattern.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class SMATCH extends FilterFunc {
    static SMARTSQueryTool sqt;static {
        try {
            sqt = new SMARTSQueryTool("C");
        } catch (CDKException e) {
            System.out.println(e);
        }
    }
    static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());

    public Boolean exec(Tuple tuple) throws IOException {
        if (tuple == null || tuple.size() < 2) return false;
        String target = (String) tuple.get(0);
        String query = (String) tuple.get(1);
        try {
            sqt.setSmarts(query);
            IAtomContainer mol = sp.parseSmiles(target);
            return sqt.matches(mol);
        } catch (CDKException e) {
            throw WrappedIOException.wrap("Error in SMARTS pattern or SMILES string "+query, e);
        }
    }
}

A UDF for filtering must implement the FilterFunc interface which specifies a single method, exec. Within this method, we check whether we have the requisite number of input arguments and if so, simply return the value of the SMARTS match. For more details on filter functions see the UDF manual.

One of the key features of the code is the static initialization of the SMILES parser and SMARTS matcher. I’m not entirely sure how many times the UDF is instantiated during a query (once for each “row”? Once for the entire query?) – but if it’s more than once, we don’t want to instantiate the parser and matcher in the exec function. Note that since Hadoop is not a multithreaded model, we don’t need to worry about the lack of thread safety in the CDK.

Compiling the above class and packaging it into a jar file, allows us to run the above Pig Latin script (you’ll have to register the jar file at the beginning by writing register /path/to/myudf.jar) from the command line:

1
2
3
4
5
6
7
$ pig-0.4.0/bin/pig -x local match.pig # runs in local mode
$ 2010-09-09 20:37:00,107 [main] INFO  org.apache.pig.Main - Logging error messages to: /Users/rguha/src/java/hadoop-0.18.3/pig_1284079020107.log
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully stored result in: "file:/Users/rguha/src/java/hadoop-0.18.3/output.txt"
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written : 9
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 0
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
$ 2010-09-09 20:37:39,278 [main] INFO  org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!

As with my previous Hadoop code, the above UDF can be deployed anywhere that Hadoop and HDFS is installed – such as Amazon. The code in this post (and for other Pig UDFs) is available from my repository (in the v18 branch) and is based on Pig 0.4.0 – which is pretty old, but is required to work with Hadoop 0.18.3.

Written by Rajarshi Guha

September 10th, 2010 at 2:21 am