So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for April, 2010

RNAi in PubChem

without comments

While considering ways to disseminate RNAi screening data, I found out that PubChem now contains two RNAi screening datasets – AIDs 1622 and 1904. These screens reuse the PubChem bioaassay formats – which is both good and bad. For example, while there are a few standardized columns (such as PUBCHEM_ACTIVITY_SCORE), the bulk of the user deposited columns are not formally defined. In other words, you’d have to read the assay description. While not a huge deal, it would be nice if we could use pre-existing formats such as MIARE, analogous to MIAME for microarray data. That way we could determine the number of replicates, normalization method employed and other details of the screen. As far as I can tell all aspects an RNAi screen are still not fully defined in the MIAME vocabulary, and there don’t seem to be a whole lot of examples. But it’s a start.

But of course, nothing is perfect. Why, oh why, would a tab delimited format be contained within multiple worksheets of an Excel workbook!

Written by Rajarshi Guha

April 19th, 2010 at 12:19 am

Posted in bioinformatics

Tagged with , ,

Automating ChemDraw

without comments

I’ve been working on a project in which I needed to generate logP values using ChemDraw 12, for thousands of molecules. Since I didn’t have access to the ChemScript module, I needed a way to automate this procedure. After fiddling around with Visual Studio and various debuggers, I came across the Windows Application Testing Using Python (WATSUP). This is a set of Python methods, built on top of the win32api package that allows one to interact with a Windows GUI programmatically. Thus one can identify a top level window, get a specific control (button, combo box etc) and click on buttons, menu items and so on.

After a little digging around (and liberal use of the dumpWindow() function in WATSUP), I was able to put together a simple bit of code that would load an SDF (containing a single structure) and save it as a CDXML file. For this to work, I make sure that the ChemDraw application is running and the “Chemical Properties” window is visible. On loading an SDF, the chemical properties (stuff like logP, MR, melting point etc) get computed automatically. We then “click” the paste button and then save to CDXML format. In the resultant CDXML file, the chemical property values are included – which can then be easily extracted using a regex. Here’s the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import win32com.client
import win32api
import sys, os, time, datetime, glob, pprint
from watsup.winGuiAuto import *
from watsup.launcher import launchApp,terminateApp

def getButtonControl(toplevel, buttonText):
    elems = dumpWindow(toplevel)
    openButtonHwnd = None
    for elem in elems:
        if len(elem) != 3: continue
        hwnd, name, klass = elem
        if name == buttonText and klass == 'Button':
            openButtonHwnd = hwnd
            break
    openButton = findControls(toplevel, wantedClass="Button", wantedText=buttonText)
    openButton = openButton[openButton.index(openButtonHwnd)]
    return openButton

def setFileName(toplevel, filename):
    fnameCombo = findControl(toplevel, wantedClass='ComboBoxEx32')
    fnameComboEdit = findControl(fnameCombo, wantedClass = "Edit")
    setEditText(fnameComboEdit, filename)
       
filename = "c:\\work\\test.sdf"

form=findTopWindow(wantedText='ChemBioDraw Ultra')
mainMenu = getTopMenu(form)

# click the File->Open menu option
activateMenuItem(form, ('file', 'open'))
time.sleep(0.5)

## OK, we get the file selection window
openWindow = findTopWindow(wantedText='Open')

## get the file type combo box and select the SDF format
ftypeCombo = findControls(openWindow, wantedClass = "ComboBox")    
if len(getComboboxItems(ftypeCombo[0])) > 10: ftypeCombo = ftypeCombo[0]
elif len(getComboboxItems(ftypeCombo[1])) > 10: ftypeCombo = ftypeCombo[1]
else: ftypeCombo = ftypeCombo[2]
selectComboboxItem(ftypeCombo, 14)

## get the filename combo box and set it
setFileName(openWindow, filename)

## get the open button, click it and get the file
openButton = getButtonControl(openWindow, "&Open")
clickButton(openButton)
time.sleep(1)

## get the properties window and then paste in the
## chemical properties that are autocalculated
propWindow = findTopWindow(wantedText = "Chemical Properties", retryInterval = 0.1, maxWait = 5)
pasteButton = findControl(propWindow, wantedClass="Button", wantedText="Paste")
clickButton(pasteButton)

## now save the file as a cdxml
newfilename = getOutputfileName(filename)
activateMenuItem(form, ("file", "save as"))
time.sleep(0.5)
saveWindow = findTopWindow(wantedText='Save As')
setFileName(saveWindow, newfilename)

## set file type
ftypeCombo = findControls(saveWindow, wantedClass = "ComboBox")
if len(getComboboxItems(ftypeCombo[0])) > 10: ftypeCombo = ftypeCombo[0]
elif len(getComboboxItems(ftypeCombo[1])) > 10: ftypeCombo = ftypeCombo[1]
else: ftypeCombo = ftypeCombo[2]
selectComboboxItem(ftypeCombo, 1)

## get the save button and click it to save the file
saveButton = getButtonControl(saveWindow, "&Save")
clickButton(saveButton)
time.sleep(1)

## looks like we have to do a save and then we close
activateMenuItem(form, ['file', 'save'])
activateMenuItem(form, ['file', 'close'])

Obviously this approach is an inelegant hack and is dog slow – approximately 2.5 sec per structure. But in the absence of anything else, it gets the job done.

While implementing this solution there were a few quirks. For example, the widgets contained with in window, represent a hierarchy. The findControls() method traverses this hierarchy, but does not return the controls of the specific type in the same order on consecutive runs. So to find the appropriate combo box (say for the file type) I need to do some extra work (rather than just going with the first of three combo boxes that are located on a Open dialog). One contributing factor to the slowness is that I needed to insert a few sleep statements here and there, to ensure that the proper windows showed up before I started setting values in the various widgets. Finally, for some reason I had to do a “Save As” followed by a “Save” to get the final CDXML file with all the computed properties.

Written by Rajarshi Guha

April 13th, 2010 at 6:21 pm

Posted in software

Tagged with , , , , ,

What Has Cheminformatics Done for You Lately?

with 3 comments

Recently there have been two papers asking whether cheminformatics or virtual screening in general, have really helped drug discovery, in terms of lead discovery.

The first paper from Muchmore et al focuses on the utility of various cheminformatics tools in drug discovery.  Their report is retrospective in nature where they note that while much research has been done in developing descriptors and predictors of various molecular properties (solubility, bioavilability etc), it does not seem that this has contributed to increased productivity. They suggest three possible reasons for this

  • not enough time to judge the contributions of cheminformatics methods
  • methods not being used properly
  • methods themselves not being sufficiently accurate.

They then go on consider how these reasons may apply to various cheminformatics methods and tools that are accessible to medicinal chemists. Examples range from molecular weight and ligand efficiency to solubility, similarity and bioisosteres. They use a 3-class scheme – known knowns, unknown knowns and unknown unknowns corresponding to methods whose underlying principles are whose results can be robustly interpreted, methods for properties that we don’t know how to realistically evaluate (but which we may still do so – such as solubility) and methods for which we can get a numerical answer but whose meaning or validity is doubtful. Thus for example, ligand binding energy calculations are placed in the “unknown unknown” category and similarity searches are placed in the “known unknown” category.

It’s definitely an interesting read, summarizing the utility of various cheminformatics techniques. It raises a number of interesting questions and issues. For example, a recurring issue is that many cheminformatics methods are ultimately subjective, even though the underlying implementation may be quantitative – “what is a good Tanimoto cutoff?” in similarity calculations would be a classic example.  The downside of the article is that it does appear at times to be specific to practices at Abbott.

The second paper is by Schneider and is more prospective and general in nature and discusses some reasons as to why virtual screening has not played a more direct role in drug discovery projects. One of the key points that Schneider makes is that

appropriate “description of objects to suit the problem” might be the key to future success

In other words, it may be that molecular descriptors, while useful surrogates of physical reality, are probably not sufficient to get us to the next level. Schneider even states that “… the development of advanced virtual screening methods … is currently stagnated“. This statement is true in many ways, especially if one considers the statistical modeling side of virtual screening (i.e., QSAR). Many recent papers discuss slight modifications to well known algorithms that invariably lead to an incremental improvement in accuracy. Schneider suggests that improvements in our understanding of the physics of the drug discovery problem – protein folding, allosteric effects, dynamics of complex formation, etc – rather than continuing to focus on static properties (logP etc) will lead to advances. Another very valid point is that future developments will need to move away from the prediction or modeling of “… one to one interactions between a ligand and a single target …”  and instead will need to consider “… many to many relationships …“. In other words, advances in virtual screen will address (or need to address) the ligand non-specificity or promiscuity. Thus activity profiles, network models and polyparmacology will all be vital aspects of successful virtual screening.

I really like Schneiders views on the future of virtual screening, even though they are rather general. I agree with his views on the stagnation of machine learning (QSAR) methods but at the same time I’m reminded of a paper by Halevy et al, which highlights the fact that

simple models and a lot of data trump more elaborate models based on less data

Now, they are talking about natural language processing using trillion-word corpora. Not exactly the situation we face in drug discovery! But, it does look like we’re slowly going in the direction of generating biological datasets of large size and of multiple types. A recent NIH RFP proposes this type of development. Coupled with well established machine learning methods, this could be lead to some very interesting developments. (Of course even ‘simple’ properties such as solubility could benefit from a ‘large data’ scenario as noted by Muchmore et al).

Overall, two interesting papers looking at the state of the field from different views.

Written by Rajarshi Guha

April 5th, 2010 at 4:33 am

CDKDescUI Updates – DnD & Batch Mode

with 2 comments

I’ve put out an updated version (1.0.1) of the CDK descriptor calculator that now supports drag ‘n drop of the input file – just drag an appropriate file onto the UI and the input file text field should be automatically populated. In addition, all file dialogs let OS X users specify a file name manually.

The current version also supports a, frequently requested, command line batch mode. It’s a little limited compared to the GUI since you can’t specify individual descriptors, only descriptor categories (such as ‘all’, ‘topological’ etc) and the only output format is tab delimited.

1
2
3
4
5
6
7
8
9
10
11
12
$ java -jar CDKDescUI.jar -h

usage: cdkdescui [OPTIONS] inputfile
                 
 -b    Batch mode
 -h    Help
 -o    Output file
 -t    Descriptor type: all, topological, geometric, constitutional,
       electronic, hybrid
 -v    Verbose output

CDKDescUI v1.0.1 Rajarshi Guha <rajarshi.guha@gmail.com>

By default, output is dumped to output.txt and all descriptors are evaluated. If errors occur for a given molecule and descriptor they are reported at the end (i.e., the program continues)

Written by Rajarshi Guha

April 4th, 2010 at 5:01 pm

Posted in software

Tagged with , ,

CDKDescUI Update

without comments

I’ve put out a new version (0.98) of the CDK descriptor calculator interface which uses the latest CDK master and also updates the save dialog for the descriptor selections to let the user specify a file name.

Written by Rajarshi Guha

April 3rd, 2010 at 3:19 pm

Posted in software

Tagged with , , ,