Archive for the ‘Uncategorized’ Category
When running a high-throughput screen, one usually deals with hundreds or even thousands of plates. Due to the vagaries of experiments, some plates will not be ervy good. That is, the data will be of poor quality due to a variety of reasons. Usually we can evaluate various statistical quality metrics to asses which plates are good and which ones need to be redone. A common metric is the Z-factor which uses the positive and negative control wells. The problem is, that if one or two wells have a problem (say, no signal in the negative control) then the Z-factor will be very poor. Yet, the plate could be used if we just mask those bad wells.
Now, for our current screens (100 plates) manual inspection is boring but doable. As we move to genome-wide screens we need a better way to identify truly bad plates from plates that could be used. One approach is to move to other metrics – SSMD (defined here and applications to quality control discussed here) is regarded as more effective than Z-factor – and in fact it’s advisable to look at multiple metrics rather than depend on any single one.
An alternative trick is to compare the Z-factor for a given plate to the trimmed Z-factor, which is evaluated using the trimmed mean and standard deviations. In our set up we trim 10% of the positive and negative control wells. For a plate that appears to be poor, due to one or two bad control wells, the trimmed Z-factor should be significantly higher than the original Z-factor. But for a plate in which, say the negative control wells all show poor signal, there should not be much of a difference between the two values. The analysis can be rapidly performed using a plot of the two values, as shown below. Given such a plot, we’d probably consider plates whose trimmed Z-factor are less than 0.5 and close to the diagonal. (Though for RNAi screens, Z’ = 0.5 might be too stringent).
From the figure below, just looking at Z-factor would have suggested 4 or 5 plates to redo. But when compared to the trimmed Z-factor, this comes down to a single plate. Of course, we’d look at other statistics as well, but it is a quick way to rapidly identify plates with truly poor Z-factors.
Heatmaps are a common way to visualize matrices and R provides a variety of methods to generate these diagrams. One of the key features of a heatmap is the color scheme employed. By default the image method uses heat.colors which ranges from red (lowest values) to white (highest values). Other palettes include rainbow and topographical. However, I needed to replicate a color scheme used in a previous publication – dark red for low values and white for high values, passing through shades of pink. While RColorBrewer is an extremely useful package for generating sensible and informative color schemes it didn’t help me out here – the red palette was limited to 9 colors, whereas I needed a larger range. There’s also the colorschemes method from the ClassDiscovery package, but I haven’t checked that out.
So, I ended up making my own color scheme, realizing that shades of pink are a constant red, coupled with increasing amounts of blue and green (in the RGB scheme). Thus generating a matrix of R, G and B values, with a constant red value and increasing blue and green values and then converting the rows to hex colors (via rgb, gives me the required colors.
rgbs <- cbind(255,
seq(32,255, length=256)) / 255
cols <- rgb(rgbs[,1], rgbs[,2], rgbs[,3])
and an example of the color scheme
image(matrix(1:256, nrow=64, byrow=TRUE),
Some handy settings when running a query from the command line via sqlplus
set echo off set heading on set linesize 1024 set pagesize 0 set tab on set trims on set wrap off
-- might want to set column formats here
-- e.g.: column foo format A10
spool stats -- dump results to stats.lst
-- SQL query here
The move to Rockville is complete and I’ve started work at the NCGC. This week has been a bit hectic what with setting up house and getting up to speed at work. But I’m really excited with the stuff that’s going on here – lots of interesting projects in various areas of chemical biology and genomics. Also, being surrounded by some really smart people is invigorating (and leads to some nice lunch discussions). One aspect of the center that I really like is the commitment to Open Source – while there isn’t a whole lot at the moment, there’s some cool stuff coming down the pipeline. And a 30″ Apple Cinema display is sweet.
For now, back to reading up on HTS, RNAi and browsing for sofas.
In a previous post I described how one requires a custom RecordReader class to deal with multi-line records (such as SD files) in a Hadoop program. While it worked fine on a small input file (less than 5MB) I had not addressed the issue of “chunking” and that caused it to fail when dealing with larger files (the code in that post is updated now).
When a Hadoop program is run on an input file, the framework will send chunks of the input file to individual RecordReader instances. Note that it doesn’t actually read the entire file and send around portions of it – that would not scale to petabyte files! Rather, it determines the size of the file and ends start and end offsets into the original file, to the RecordReaders. They then seek to the appropriate position in the original file and then do their work.
The problem with this is that when a RecordReader receives a chunk (defined in terms of start and offsets), it can start in the middle of a record and end in the middle of another record. This shown schematically in the figure, where the input file with 5 multi-line, variable length records is divided into 5 chunks. As you can see, in the general case, chunks don’t start or end on record boundaries.
My initial code, when faced with chunks failed badly since rather than recognizing chunk boundaries it simply read each record in the whole file. Alternatively (and naively) if one simply reads up to a chunk boundary, the last and first records read from that chunk will generally be invalid.
The correct (and simple) strategy for an arbitrary chunk, is to make sure that the start position is not 0. If so, we read the bytes from the start position until we reach the first end of record marker. In general, the record we just read will be incomplete, so we discard it. We then carry on reading complete records as usual. But if, after reading a record, we note that the current file position is beyond the end position of the current chunk, we note that the chunk is done with and just return this last record. Thus, according to the figure, when processing he second chunk from the top, we read in bytes 101 to 120 and discard that data. We then start reading the initial portion of Record 3 until the end of the record, at position 250 – even though we’ve gone beyond the chunk boundary at position 200. However we now flag that we’re done with the chunk and carry on.
When another RecordReader class gets the next chunk starting at position 200, it will be dumped into the middle of Record 3. But, according to our strategy, we simply read till the end of record marker at position 250 and discard the data. This is OK, since the RecordReader instance that handled the previous chunk already read the whole of Record 3.
The two edge cases here are when the chunk starts at position 0 (beginning of the input file) and the chunk ends at the end of file. In the former case, we don’t discard anything, but simply process the entire chunk plus a bit beyond it to get the entire last record for this chunk. For the latter case, we simply check whether we’re at the end of the file and flag it to the nextKeyValue() method.
The implementation of this strategy is shown in the SDFRecordReader class listing.
In hindsight this is pretty obvious, but I was bashing myself for a while and hopefully this explanation saves others some effort.