So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘sirna’ tag

Working with Sequences in R

with one comment

I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried

1
2
3
4
seqs <- get_sequences_from_db()
seqs <- sapply(seqs, function(x) {
  as.character(complement(RNAString(DNAString(x))))
})

but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.

1
2
seqs <- get_sequences_from_db()
seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs))))

A slightly contrived example shows the performance improvement

1
2
3
4
5
x <- sapply(1:1000, function(x) {
    paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='')
})
system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x)))))
system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) )))

Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object

Written by Rajarshi Guha

October 20th, 2010 at 10:11 pm

Posted in bioinformatics,software

Tagged with , ,