I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried
1 2 3 4 | seqs <- get_sequences_from_db() seqs <- sapply(seqs, function(x) { as.character(complement(RNAString(DNAString(x)))) }) |
but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.
1 2 | seqs <- get_sequences_from_db() seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs)))) |
A slightly contrived example shows the performance improvement
1 2 3 4 5 | x <- sapply(1:1000, function(x) { paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='') }) system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x))))) system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) ))) |
Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object
[…] This post was mentioned on Twitter by Neil Saunders, Neil Saunders. Neil Saunders said: Liked "Working with Sequences in R" http://ff.im/-so6oV […]