Preprocessing ONS Solubility Data

With the semester winding up and preparing to move to Rockville, things have been a bit hectic. However, I’ve been trying to keep track of the ONS solubility modeling project and one of the irritating things is that each time I want to build a model (due to new data being submitted), I need to extract and clean the data from the Google spreadsheet. So, finally put together some Python code to get the solubility data, clean it up, filter out invalid rows (as noted by the DONOTUSE string) and optionally filter rows based on a specified string. This allows me to get the whole dataset at one go, or just the data for methanol etc. Note that it doesn’t dump all the columns from the original spreadsheet – just the columns I need for modeling. A very simplistic script that dumps the final data in tab-delimited format to STDOUT.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
## Rajarshi Guha
## 04/14/2009
## Update 04/20/2009 - include solute state column

import urllib
import csv
import getopt
import sys

def usage():
    print """
Usage: solsum.py [OPTIONS]

Retrieves the SolubilitiesSum spreadsheet from Google and
processes Sheet 1 to extract solubility data. Right now it just
pulls out the solute name and smiles, solvent name and solubility.
It will filter out entries that are marked as DONOTUSE. If desired
you can aggregate solubility values for multiple instances of the
same compound using a variety of functions.

The final data table is output as tab separated onto STDOUT.

OPTIONS can be

-h, --help            This message
-a, --agg FUNC        Aggregate function. Valid values can be
                      'mean', 'median', 'min', 'max'.
                      Currently this is not supported
-f, --filter STRING   Only include rows that have a column that exactly
                      matches the specified STRING
-d, --dry             Don't output data, just report some stats
"""


url = 'http://spreadsheets.google.com/pub?key=plwwufp30hfq0udnEmRD1aQ&output=csv&gid=0'

idx_solute_name = 3
idx_solute_smiles = 4
idx_solvent_name = 5
idx_conc = 7
idx_state = 24

if __name__ == '__main__':

    fstring = None
    agg = None
    dry = False

    solubilities = []

    try:
        opts, args = getopt.getopt(sys.argv[1:], "hdf:a:", ["help", "dry", "filter=", "agg="])
    except getopt.GetoptError:
        usage()
        sys.exit(-1)

    for opt, arg in opts:
        if opt in ('-h', '--help'):
            usage()
            sys.exit(1)
        elif opt in ('-f', '--filter'):
            fstring = arg
        elif opt in ('-d', '--dry'):
            dry = True

    data = csv.reader(urllib.urlopen(url))
    data.next()
    c = 2
    for row in data:
        line = [x.strip() for x in row]
        if len(line) != 25:
            print 'ERROR (Line %d) %s' % (c, ','.join(line))
            continue
        solubilities.append( (line[idx_solute_name],
                              line[idx_solute_smiles],
                              line[idx_solvent_name],
                              line[idx_conc],
                              line[idx_state]) )
        c += 1

    if dry:
        print 'Got %d entries' % (len(solubilities))
    solubilities = [ x for x in solubilities if x[0].find("DONOTUSE") == -1]
    if dry:
        print 'Got %d entries after filtering invalid entries' % (len(solubilities))

    if not dry:
        for row in solubilities:
            if fstring:
                if any(map(lambda x: x == fstring, row)):
                    print '\t'.join([str(x) for x in row])
                    continue
            else:
                print '\t'.join([str(x) for x in row])

Leave a Reply

Your email address will not be published. Required fields are marked *