Preprocessing ONS Solubility Data

With the semester winding up and preparing to move to Rockville, things have been a bit hectic. However, I’ve been trying to keep track of the ONS solubility modeling project and one of the irritating things is that each time I want to build a model (due to new data being submitted), I need to extract and clean the data from the Google spreadsheet. So, finally put together some Python code to get the solubility data, clean it up, filter out invalid rows (as noted by the DONOTUSE string) and optionally filter rows based on a specified string. This allows me to get the whole dataset at one go, or just the data for methanol etc. Note that it doesn’t dump all the columns from the original spreadsheet – just the columns I need for modeling. A very simplistic script that dumps the final data in tab-delimited format to STDOUT.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93

## Rajarshi Guha
## 04/14/2009
## Update 04/20/2009 - include solute state column

import urllib
import csv
import getopt
import sys

def usage():
print """
Usage: solsum.py [OPTIONS]

Retrieves the SolubilitiesSum spreadsheet from Google and
processes Sheet 1 to extract solubility data. Right now it just
pulls out the solute name and smiles, solvent name and solubility.
It will filter out entries that are marked as DONOTUSE. If desired
you can aggregate solubility values for multiple instances of the
same compound using a variety of functions.

The final data table is output as tab separated onto STDOUT.

OPTIONS can be

-h, --help This message
-a, --agg FUNC Aggregate function. Valid values can be
'mean', 'median', 'min', 'max'.
Currently this is not supported
-f, --filter STRING Only include rows that have a column that exactly
matches the specified STRING
-d, --dry Don't output data, just report some stats
"""

url = 'http://spreadsheets.google.com/pub?key=plwwufp30hfq0udnEmRD1aQ&output=csv&gid=0'

idx_solute_name = 3
idx_solute_smiles = 4
idx_solvent_name = 5
idx_conc = 7
idx_state = 24

if __name__ == '__main__':

fstring = None
agg = None
dry = False

solubilities = []

try:
opts, args = getopt.getopt(sys.argv[1:], "hdf:a:", ["help", "dry", "filter=", "agg="])
except getopt.GetoptError:
usage()
sys.exit(-1)

for opt, arg in opts:
if opt in ('-h', '--help'):
usage()
sys.exit(1)
elif opt in ('-f', '--filter'):
fstring = arg
elif opt in ('-d', '--dry'):
dry = True

data = csv.reader(urllib.urlopen(url))
data.next()
c = 2
for row in data:
line = [x.strip() for x in row]
if len(line) != 25:
print 'ERROR (Line %d) %s' % (c, ','.join(line))
continue
solubilities.append( (line[idx_solute_name],
line[idx_solute_smiles],
line[idx_solvent_name],
line[idx_conc],
line[idx_state]) )
c += 1

if dry:
print 'Got %d entries' % (len(solubilities))
solubilities = [ x for x in solubilities if x[0].find("DONOTUSE") == -1]
if dry:
print 'Got %d entries after filtering invalid entries' % (len(solubilities))

if not dry:
for row in solubilities:
if fstring:
if any(map(lambda x: x == fstring, row)):
print '\t'.join([str(x) for x in row])
continue
else:
print '\t'.join([str(x) for x in row])

So much to do, so little time

Trying to squeeze sense out of chemical data

Leave a Reply Cancel reply