So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘ons’ tag

Preprocessing ONS Solubility Data

without comments

With the semester winding up and preparing to move to Rockville, things have been a bit hectic. However, I’ve been trying to keep track of the ONS solubility modeling project and one of the irritating things is that each time I want to build a model (due to new data being submitted), I need to extract and clean the data from the Google spreadsheet. So, finally put together some Python code to get the solubility data, clean it up, filter out invalid rows (as noted by the DONOTUSE string) and optionally filter rows based on a specified string. This allows me to get the whole dataset at one go, or just the data for methanol etc. Note that it doesn’t dump all the columns from the original spreadsheet – just the columns I need for modeling. A very simplistic script that dumps the final data in tab-delimited format to STDOUT.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
## Rajarshi Guha
## 04/14/2009
## Update 04/20/2009 - include solute state column

import urllib
import csv
import getopt
import sys

def usage():
    print """
Usage: solsum.py [OPTIONS]

Retrieves the SolubilitiesSum spreadsheet from Google and
processes Sheet 1 to extract solubility data. Right now it just
pulls out the solute name and smiles, solvent name and solubility.
It will filter out entries that are marked as DONOTUSE. If desired
you can aggregate solubility values for multiple instances of the
same compound using a variety of functions.

The final data table is output as tab separated onto STDOUT.

OPTIONS can be

-h, --help            This message
-a, --agg FUNC        Aggregate function. Valid values can be
                      'mean', 'median', 'min', 'max'.
                      Currently this is not supported
-f, --filter STRING   Only include rows that have a column that exactly
                      matches the specified STRING
-d, --dry             Don't output data, just report some stats
"""


url = 'http://spreadsheets.google.com/pub?key=plwwufp30hfq0udnEmRD1aQ&output=csv&gid=0'

idx_solute_name = 3
idx_solute_smiles = 4
idx_solvent_name = 5
idx_conc = 7
idx_state = 24

if __name__ == '__main__':

    fstring = None
    agg = None
    dry = False

    solubilities = []

    try:
        opts, args = getopt.getopt(sys.argv[1:], "hdf:a:", ["help", "dry", "filter=", "agg="])
    except getopt.GetoptError:
        usage()
        sys.exit(-1)

    for opt, arg in opts:
        if opt in ('-h', '--help'):
            usage()
            sys.exit(1)
        elif opt in ('-f', '--filter'):
            fstring = arg
        elif opt in ('-d', '--dry'):
            dry = True

    data = csv.reader(urllib.urlopen(url))
    data.next()
    c = 2
    for row in data:
        line = [x.strip() for x in row]
        if len(line) != 25:
            print 'ERROR (Line %d) %s' % (c, ','.join(line))
            continue
        solubilities.append( (line[idx_solute_name],
                              line[idx_solute_smiles],
                              line[idx_solvent_name],
                              line[idx_conc],
                              line[idx_state]) )
        c += 1

    if dry:
        print 'Got %d entries' % (len(solubilities))
    solubilities = [ x for x in solubilities if x[0].find("DONOTUSE") == -1]
    if dry:
        print 'Got %d entries after filtering invalid entries' % (len(solubilities))

    if not dry:
        for row in solubilities:
            if fstring:
                if any(map(lambda x: x == fstring, row)):
                    print '\t'.join([str(x) for x in row])
                    continue
            else:
                print '\t'.join([str(x) for x in row])

Written by Rajarshi Guha

April 14th, 2009 at 8:40 pm

Posted in software

Tagged with ,

Substructure Matching, REST style

without comments

I’ve been putting up a number of REST services for a variety of cheminformatics tasks. One that was missing was substructure searching. In many scenarios it’s useful to be able to check whether a target molecule contains a query substructure or not. This can now be done by visiting URL’s of the form

1
http://rguha.ath.cx/~rguha/cicc/rest/substruct/TARGET/QUERY

where TARGET and QUERY are SMILES and SMARTS (or SMILES) respectively (appropriately escaped). If the query pattern is found in the target molecule then the resultant page contains the string “true” otherwise it contains the string “false”. The service uses OpenBabel to perform the SMARTS matching.

Using this service, I updated the ONS data query page to allow one to filter results by SMARTS patterns. This generally only makes sense when no specific solute is selected. However, filtering all the entries in the spreadsheet (i.e., any solvent, any solute) can be slow, since each molecule is matched against the SMARTS pattern using a separate HTTP requests. This could be easily fixed using POST, but it’s a hack anyway since this type of thing should probably be done in the database (i.e., Google Spreadsheet).

Update

The substructure search service is now updated to accept POST requests. As a result, it is possible to send in multiple SMILES strings and match them against a pattern all at one go. See the repository for a description on how to use the POST method. (The GET method is still supported but you can only match a pattern against one target SMILES). As a result, querying the ONS data using SMARTS pattens is significantly faster.

Written by Rajarshi Guha

February 3rd, 2009 at 6:01 pm

ONS Solubility Predictions

without comments

Using the model deployment and prediction service, I put up the two linear regression models I had built so far (described in more detail here) While REST is nice, a simple web page that allows you to paste a set of SMILES and get back predictions is handy. So I whipped together a simple interface to the prediction service, allowing one to select a model, view the author-generated description and a get a nice (sortable!) table of predicted values. View it here. As noted in my previous post it’s not going to be very fast, but hopefully that will change in the future.

Written by Rajarshi Guha

January 14th, 2009 at 9:31 pm

Posted in software

Tagged with , , , ,

Live ONS Solubility Queries

without comments

In a previous post, I described a simple web form to query and visualize the solubility data being generated as part of the ONS Challenge. The previous approach required me to manually download the data and load it into a Postgres database. While trivial from a coding point of view, it’s a pain since I have to keep my local DB in sync with the Google Docs spreadsheet.

However, Google comes to the rescue with their Query API, which allows us to view the spreadsheet as a table which can be queried using an SQL like language. As a result, I can ditch the whole local database, and simply have an HTML page constructed using Javascript, which performs queries directly on the solubility spreadsheet.

This is very nice since I now no longer have to maintain a local DB and ensure that it’s in sync with Jean-Claudes results. Of course, there are some drawbacks to this method. First, the query page will assume that the data in the spreadsheet is clean. So if there are two entries called “Ethanol” and “ethanol”, they will be considered seperate solvents. Secondly, this approach cannot be used to include cheminformatics in the queries, since Google doesn’t support that functionality. Finally, it’s not going to be very good for large spreadsheets.

However, this is a very nice API, that allows one to elegantly integrate web applications with live data. I heart Google!

Written by Rajarshi Guha

November 6th, 2008 at 8:01 pm

Solubility Queries and the Google Visualization API

with one comment

There was a FriendFeed dicussion on the use of RDF triples for representing the solubilty data generated by Jean-Claude and others as part of the ONS Solubility Challenge. Part of the discussion revolved around letting RDF novices easily perform queries of the data being collected.  Not knowing much about RDF, I took the raw data from the Google Docs and loaded it into a Postgres database and whipped up a simple query form.

The DB and form are nothing remarkable. But what is cool is that the Google Visualization API makes it really easy for me include charts and other visualizations very easily. For example, if you select “any” as the solvent and then select a solute, the form creates a table of solubilities of that solute in all the solvents it was measured in. A natural view of the data is to look at a bar chart of the solubilities across the various solvents.

Since my form is built using mod_python, it’s a simple matter to write out the Javascript to call the Google API. After some boilerplate code, all that needs to be done is to create a DataTable object, set the column types and names and then populate it. See here for example code, which I modified.

1
2
3
4
5
6
7
8
var data = new google.visualization.DataTable();
data.addColumn(’string’, ‘Solvent’);
data.addColumn(’number’, ‘Conc (M)’);
data.addRows(5);
data.setValue(0, 0, ‘thf’);
data.setValue(0, 1, 1.23);
data.setValue(1, 0, ‘acetonitrile’);
data.setValue(1, 1, 2.34);

Once you have the data all stored, some more boilerplate code allows us to easily insert the chart into the final web page. Very neat!

(Of course, since these queries do not involve chemistry / cheminformatics, I could skip Python and Postgres and simply do the whole thing in Javascript, querying the Google Docs spreadsheet directly. This means that the results from the form would always be in sync with the Google Doc, but that’s for another evening)

Written by Rajarshi Guha

November 6th, 2008 at 3:39 am

Posted in software,visualization

Tagged with , ,