With the semester winding up and preparing to move to Rockville, things have been a bit hectic. However, I’ve been trying to keep track of the ONS solubility modeling project and one of the irritating things is that each time I want to build a model (due to new data being submitted), I need to extract and clean the data from the Google spreadsheet. So, finally put together some Python code to get the solubility data, clean it up, filter out invalid rows (as noted by the DONOTUSE string) and optionally filter rows based on a specified string. This allows me to get the whole dataset at one go, or just the data for methanol etc. Note that it doesn’t dump all the columns from the original spreadsheet – just the columns I need for modeling. A very simplistic script that dumps the final data in tab-delimited format to STDOUT.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | ## Rajarshi Guha ## 04/14/2009 ## Update 04/20/2009 - include solute state column import urllib import csv import getopt import sys def usage(): print """ Usage: solsum.py [OPTIONS] Retrieves the SolubilitiesSum spreadsheet from Google and processes Sheet 1 to extract solubility data. Right now it just pulls out the solute name and smiles, solvent name and solubility. It will filter out entries that are marked as DONOTUSE. If desired you can aggregate solubility values for multiple instances of the same compound using a variety of functions. The final data table is output as tab separated onto STDOUT. OPTIONS can be -h, --help This message -a, --agg FUNC Aggregate function. Valid values can be 'mean', 'median', 'min', 'max'. Currently this is not supported -f, --filter STRING Only include rows that have a column that exactly matches the specified STRING -d, --dry Don't output data, just report some stats """ url = 'http://spreadsheets.google.com/pub?key=plwwufp30hfq0udnEmRD1aQ&output=csv&gid=0' idx_solute_name = 3 idx_solute_smiles = 4 idx_solvent_name = 5 idx_conc = 7 idx_state = 24 if __name__ == '__main__': fstring = None agg = None dry = False solubilities = [] try: opts, args = getopt.getopt(sys.argv[1:], "hdf:a:", ["help", "dry", "filter=", "agg="]) except getopt.GetoptError: usage() sys.exit(-1) for opt, arg in opts: if opt in ('-h', '--help'): usage() sys.exit(1) elif opt in ('-f', '--filter'): fstring = arg elif opt in ('-d', '--dry'): dry = True data = csv.reader(urllib.urlopen(url)) data.next() c = 2 for row in data: line = [x.strip() for x in row] if len(line) != 25: print 'ERROR (Line %d) %s' % (c, ','.join(line)) continue solubilities.append( (line[idx_solute_name], line[idx_solute_smiles], line[idx_solvent_name], line[idx_conc], line[idx_state]) ) c += 1 if dry: print 'Got %d entries' % (len(solubilities)) solubilities = [ x for x in solubilities if x[0].find("DONOTUSE") == -1] if dry: print 'Got %d entries after filtering invalid entries' % (len(solubilities)) if not dry: for row in solubilities: if fstring: if any(map(lambda x: x == fstring, row)): print '\t'.join([str(x) for x in row]) continue else: print '\t'.join([str(x) for x in row]) |