Archive for the ‘ons’ tag
With the semester winding up and preparing to move to Rockville, things have been a bit hectic. However, I’ve been trying to keep track of the ONS solubility modeling project and one of the irritating things is that each time I want to build a model (due to new data being submitted), I need to extract and clean the data from the Google spreadsheet. So, finally put together some Python code to get the solubility data, clean it up, filter out invalid rows (as noted by the DONOTUSE string) and optionally filter rows based on a specified string. This allows me to get the whole dataset at one go, or just the data for methanol etc. Note that it doesn’t dump all the columns from the original spreadsheet – just the columns I need for modeling. A very simplistic script that dumps the final data in tab-delimited format to STDOUT.
## Rajarshi Guha
## Update 04/20/2009 - include solute state column
Usage: solsum.py [OPTIONS]
Retrieves the SolubilitiesSum spreadsheet from Google and
processes Sheet 1 to extract solubility data. Right now it just
pulls out the solute name and smiles, solvent name and solubility.
It will filter out entries that are marked as DONOTUSE. If desired
you can aggregate solubility values for multiple instances of the
same compound using a variety of functions.
The final data table is output as tab separated onto STDOUT.
OPTIONS can be
-h, --help This message
-a, --agg FUNC Aggregate function. Valid values can be
'mean', 'median', 'min', 'max'.
Currently this is not supported
-f, --filter STRING Only include rows that have a column that exactly
matches the specified STRING
-d, --dry Don't output data, just report some stats
url = 'http://spreadsheets.google.com/pub?key=plwwufp30hfq0udnEmRD1aQ&output=csv&gid=0'
idx_solute_name = 3
idx_solute_smiles = 4
idx_solvent_name = 5
idx_conc = 7
idx_state = 24
if __name__ == '__main__':
fstring = None
agg = None
dry = False
solubilities = 
opts, args = getopt.getopt(sys.argv[1:], "hdf:a:", ["help", "dry", "filter=", "agg="])
for opt, arg in opts:
if opt in ('-h', '--help'):
elif opt in ('-f', '--filter'):
fstring = arg
elif opt in ('-d', '--dry'):
dry = True
data = csv.reader(urllib.urlopen(url))
c = 2
for row in data:
line = [x.strip() for x in row]
if len(line) != 25:
print 'ERROR (Line %d) %s' % (c, ','.join(line))
c += 1
print 'Got %d entries' % (len(solubilities))
solubilities = [ x for x in solubilities if x.find("DONOTUSE") == -1]
print 'Got %d entries after filtering invalid entries' % (len(solubilities))
if not dry:
for row in solubilities:
if any(map(lambda x: x == fstring, row)):
print '\t'.join([str(x) for x in row])
print '\t'.join([str(x) for x in row])
I’ve been putting up a number of REST services for a variety of cheminformatics tasks. One that was missing was substructure searching. In many scenarios it’s useful to be able to check whether a target molecule contains a query substructure or not. This can now be done by visiting URL’s of the form
where TARGET and QUERY are SMILES and SMARTS (or SMILES) respectively (appropriately escaped). If the query pattern is found in the target molecule then the resultant page contains the string “true” otherwise it contains the string “false”. The service uses OpenBabel to perform the SMARTS matching.
Using this service, I updated the ONS data query page to allow one to filter results by SMARTS patterns. This generally only makes sense when no specific solute is selected. However, filtering all the entries in the spreadsheet (i.e., any solvent, any solute) can be slow, since each molecule is matched against the SMARTS pattern using a separate HTTP requests. This could be easily fixed using POST, but it’s a hack anyway since this type of thing should probably be done in the database (i.e., Google Spreadsheet).
The substructure search service is now updated to accept POST requests. As a result, it is possible to send in multiple SMILES strings and match them against a pattern all at one go. See the repository for a description on how to use the POST method. (The GET method is still supported but you can only match a pattern against one target SMILES). As a result, querying the ONS data using SMARTS pattens is significantly faster.
Using the model deployment and prediction service, I put up the two linear regression models I had built so far (described in more detail here) While REST is nice, a simple web page that allows you to paste a set of SMILES and get back predictions is handy. So I whipped together a simple interface to the prediction service, allowing one to select a model, view the author-generated description and a get a nice (sortable!) table of predicted values. View it here. As noted in my previous post it’s not going to be very fast, but hopefully that will change in the future.
In a previous post, I described a simple web form to query and visualize the solubility data being generated as part of the ONS Challenge. The previous approach required me to manually download the data and load it into a Postgres database. While trivial from a coding point of view, it’s a pain since I have to keep my local DB in sync with the Google Docs spreadsheet.
This is very nice since I now no longer have to maintain a local DB and ensure that it’s in sync with Jean-Claudes results. Of course, there are some drawbacks to this method. First, the query page will assume that the data in the spreadsheet is clean. So if there are two entries called “Ethanol” and “ethanol”, they will be considered seperate solvents. Secondly, this approach cannot be used to include cheminformatics in the queries, since Google doesn’t support that functionality. Finally, it’s not going to be very good for large spreadsheets.
However, this is a very nice API, that allows one to elegantly integrate web applications with live data. I heart Google!
There was a FriendFeed dicussion on the use of RDF triples for representing the solubilty data generated by Jean-Claude and others as part of the ONS Solubility Challenge. Part of the discussion revolved around letting RDF novices easily perform queries of the data being collected. Not knowing much about RDF, I took the raw data from the Google Docs and loaded it into a Postgres database and whipped up a simple query form.
The DB and form are nothing remarkable. But what is cool is that the Google Visualization API makes it really easy for me include charts and other visualizations very easily. For example, if you select “any” as the solvent and then select a solute, the form creates a table of solubilities of that solute in all the solvents it was measured in. A natural view of the data is to look at a bar chart of the solubilities across the various solvents.
var data = new google.visualization.DataTable();
data.addColumn(’number’, ‘Conc (M)’);
data.setValue(0, 0, ‘thf’);
data.setValue(0, 1, 1.23);
data.setValue(1, 0, ‘acetonitrile’);
data.setValue(1, 1, 2.34);
Once you have the data all stored, some more boilerplate code allows us to easily insert the chart into the final web page. Very neat!