Spreading the Word About R & Cheminformatics

These last few days I’ve been in the UK for an EBI workshop on cheminformatics in R. It was a two day workshop, the first day focusing on general cheminormatics in R using the rcdk and rpubchem packages, and the second day focusing on doing mass spectrometry in R using XCMS and Rdisop, run by Steffen Neumann and Paul Benton. It was an excellent workshop with participation from industry and academia and skill levels ranging from new R users to experts and people with minimal cheminformatics backgrounds to full time cheminformaticians. While I think my exercises might have been a little too difficult, I think we were able to cover a variety of topics ranging from details on how to do specific cheminformatics operations in R to more application oriented tasks such as fingerprint based analysis and benchmarking virtual screening methods. The slides from the workshop are available here – it’s a pretty big slide deck and covers some introductory R (there are some mistakes in that section which I will update in the coming days), and overview of the CDK and then sections on usage and applications of the rcdk and rpubchem packages. It certainly helped that I had a very friendly audience! During the course of the workshop I also learned a few things about R (thanks to Tobias Verbeke and Steffen). Given that about 40 people or so were exposed to the rcdk package, my (known) user base should hopefully increase :) It was nice to get a patch from Tobias during the workshop, which will be incorporated once I’m back home. It was also great to meet a number of people with whom I’d only had email or FriendFeed exchanges with in the past – including Chris Swain, Mark Rijnbeek, Duncan Hull, Nico Adams (though I didn’t realize it was him when I was speaking to him – sorry Nico!), Duan Lian and Syed Asad Rahman. I also got to briefly meet some of the ChEMBL folks (John and Patricia). Monday night we had a lovely workshop dinner at The Cricketer (Clavering). Many thanks to Gabriella Rustici and Dominic Clark for organizing this and inviting me to run the first day. The only downside of this trip? It was too short :) It would’ve been great to be able to stay a day or two more to have longer discussions with various groups.

In addition to the workshop, I visited Asad and his family in Cambridge for a fantastic dinner and much useful discussion. He’s done some excellent work on SMSD and showed me some of his recent work on enzyme classification and reaction mappings. I won’t say much more as he’s writing this up, except to say that it was quite impressive and I’m eagerly looking forward to seeing the writeups. Hopefully we’ll be able to do some joint work in the near future. Given the speed up that SMSD provides for graph isomorphism, I’m in the process of updating the CDK SMARTS parser to make use of it rather than the older UIT, which should improve SMARTS matching considerably. Down the road, the pharmacophore matching code will get a similar upgrade.

I was also able to squeeze in a day trip up to Harrogate, where I grew up. It was fun to see familiar streets and places after 23 years or so. It certainly didn’t hurt to also have some pretty amazing traditional English fare (Yorkshire curd tart at Bettys and the fish ‘n chips at Graveleys was fantastic).

2D Depictions in R Plots

In preparation for the upcoming R workshop at the EBI, I’ve been cleaning up the rcdk package and updating some features. One of the new features is the ability to get a 2D depiction as a raster image. Uptil now, 2D depictions were drawn in a Swing window – this allowed you to resize the window but not much else. You really couldn’t use it for anything else but viewing.

However, R-2.11.0 provides a new function called rasterImage, which overlays a raster image onto a pre-existing plot. It turns out that the png package lets me easily create such a raster image, either from a PNG file or from a vector of bytes. Given a molecules, we can get the byte array of its PNG representation via the view.image.2d function in the latest rcdk. As a result, you can now make a plot and then overlay a 2D depiction within the plot area. For example to get the picture shown alongside, we could do:

1
2
3
4
5
library(rcdk)
m <- parse.smiles("C1CC2CC1C(=O)NC2")
img <- view.image.2d(m, 200,200)
plot(1:10, pch=19)
rasterImage(img, 2,6, 6,10)

The latest version of rcdk and rpubchem is not on CRAN yet, but you can get source packages for OS X & Linux and binary packages for Windows at http://rguha.net/rcdk. Note that the latest version of rcdk requires R-2.11.0 along with rJava, rcdklibs, fingerprint and png as dependencies. If you’re interested in contributing check out the git repository.

rcdk On GitHub

I’ve been working on the rcdk and rcdklibs R packages that integrate the CDK into the R environment. For the past year or two it’s been hosted on r-forge.r-project.org which has a nice feature of nightly builds of R packages. Unfortunately their version control system is Subversion. Having used Git for my other projects, this was getting painful. So rcdk and rcdklibs are now hosted at Github as the cdkr project. This makes development much smoother, but I do loose the automated build feature. Sometime in the future, I’ll set up the repository to provide direct links to the packages (though they will also be available on CRAN)

RNAi in PubChem

While considering ways to disseminate RNAi screening data, I found out that PubChem now contains two RNAi screening datasets – AIDs 1622 and 1904. These screens reuse the PubChem bioaassay formats – which is both good and bad. For example, while there are a few standardized columns (such as PUBCHEM_ACTIVITY_SCORE), the bulk of the user deposited columns are not formally defined. In other words, you’d have to read the assay description. While not a huge deal, it would be nice if we could use pre-existing formats such as MIARE, analogous to MIAME for microarray data. That way we could determine the number of replicates, normalization method employed and other details of the screen. As far as I can tell all aspects an RNAi screen are still not fully defined in the MIAME vocabulary, and there don’t seem to be a whole lot of examples. But it’s a start.

But of course, nothing is perfect. Why, oh why, would a tab delimited format be contained within multiple worksheets of an Excel workbook!

Automating ChemDraw

I’ve been working on a project in which I needed to generate logP values using ChemDraw 12, for thousands of molecules. Since I didn’t have access to the ChemScript module, I needed a way to automate this procedure. After fiddling around with Visual Studio and various debuggers, I came across the Windows Application Testing Using Python (WATSUP). This is a set of Python methods, built on top of the win32api package that allows one to interact with a Windows GUI programmatically. Thus one can identify a top level window, get a specific control (button, combo box etc) and click on buttons, menu items and so on.

After a little digging around (and liberal use of the dumpWindow() function in WATSUP), I was able to put together a simple bit of code that would load an SDF (containing a single structure) and save it as a CDXML file. For this to work, I make sure that the ChemDraw application is running and the “Chemical Properties” window is visible. On loading an SDF, the chemical properties (stuff like logP, MR, melting point etc) get computed automatically. We then “click” the paste button and then save to CDXML format. In the resultant CDXML file, the chemical property values are included – which can then be easily extracted using a regex. Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import win32com.client
import win32api
import sys, os, time, datetime, glob, pprint
from watsup.winGuiAuto import *
from watsup.launcher import launchApp,terminateApp

def getButtonControl(toplevel, buttonText):
    elems = dumpWindow(toplevel)
    openButtonHwnd = None
    for elem in elems:
        if len(elem) != 3: continue
        hwnd, name, klass = elem
        if name == buttonText and klass == 'Button':
            openButtonHwnd = hwnd
            break
    openButton = findControls(toplevel, wantedClass="Button", wantedText=buttonText)
    openButton = openButton[openButton.index(openButtonHwnd)]
    return openButton

def setFileName(toplevel, filename):
    fnameCombo = findControl(toplevel, wantedClass='ComboBoxEx32')
    fnameComboEdit = findControl(fnameCombo, wantedClass = "Edit")
    setEditText(fnameComboEdit, filename)
       
filename = "c:\\work\\test.sdf"

form=findTopWindow(wantedText='ChemBioDraw Ultra')
mainMenu = getTopMenu(form)

# click the File->Open menu option
activateMenuItem(form, ('file', 'open'))
time.sleep(0.5)

## OK, we get the file selection window
openWindow = findTopWindow(wantedText='Open')

## get the file type combo box and select the SDF format
ftypeCombo = findControls(openWindow, wantedClass = "ComboBox")    
if len(getComboboxItems(ftypeCombo[0])) > 10: ftypeCombo = ftypeCombo[0]
elif len(getComboboxItems(ftypeCombo[1])) > 10: ftypeCombo = ftypeCombo[1]
else: ftypeCombo = ftypeCombo[2]
selectComboboxItem(ftypeCombo, 14)

## get the filename combo box and set it
setFileName(openWindow, filename)

## get the open button, click it and get the file
openButton = getButtonControl(openWindow, "&Open")
clickButton(openButton)
time.sleep(1)

## get the properties window and then paste in the
## chemical properties that are autocalculated
propWindow = findTopWindow(wantedText = "Chemical Properties", retryInterval = 0.1, maxWait = 5)
pasteButton = findControl(propWindow, wantedClass="Button", wantedText="Paste")
clickButton(pasteButton)

## now save the file as a cdxml
newfilename = getOutputfileName(filename)
activateMenuItem(form, ("file", "save as"))
time.sleep(0.5)
saveWindow = findTopWindow(wantedText='Save As')
setFileName(saveWindow, newfilename)

## set file type
ftypeCombo = findControls(saveWindow, wantedClass = "ComboBox")
if len(getComboboxItems(ftypeCombo[0])) > 10: ftypeCombo = ftypeCombo[0]
elif len(getComboboxItems(ftypeCombo[1])) > 10: ftypeCombo = ftypeCombo[1]
else: ftypeCombo = ftypeCombo[2]
selectComboboxItem(ftypeCombo, 1)

## get the save button and click it to save the file
saveButton = getButtonControl(saveWindow, "&Save")
clickButton(saveButton)
time.sleep(1)

## looks like we have to do a save and then we close
activateMenuItem(form, ['file', 'save'])
activateMenuItem(form, ['file', 'close'])

Obviously this approach is an inelegant hack and is dog slow – approximately 2.5 sec per structure. But in the absence of anything else, it gets the job done.

While implementing this solution there were a few quirks. For example, the widgets contained with in window, represent a hierarchy. The findControls() method traverses this hierarchy, but does not return the controls of the specific type in the same order on consecutive runs. So to find the appropriate combo box (say for the file type) I need to do some extra work (rather than just going with the first of three combo boxes that are located on a Open dialog). One contributing factor to the slowness is that I needed to insert a few sleep statements here and there, to ensure that the proper windows showed up before I started setting values in the various widgets. Finally, for some reason I had to do a “Save As” followed by a “Save” to get the final CDXML file with all the computed properties.