Working with Sequences in R

I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried

1
2
3
4

seqs <- get_sequences_from_db()
seqs <- sapply(seqs, function(x) {
as.character(complement(RNAString(DNAString(x))))
})

but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.

1 2	seqs <- get_sequences_from_db() seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs))))

A slightly contrived example shows the performance improvement

1
2
3
4
5

x <- sapply(1:1000, function(x) {
paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='')
})
system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x)))))
system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) )))

Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object

One thought on “Working with Sequences in R”

So much to do, so little time

Trying to squeeze sense out of chemical data

One thought on “Working with Sequences in R”

Leave a Reply Cancel reply