So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for October, 2010

Updates to R Packages

without comments

I’ve uploaded a new version of fingerprint (v 3.4) which now supports feature fingerprints – fingerprints that are represented as variable length vectors of numbers or strings. An example would be circular fingerprints. Now, when reading fingerprints you have to indicate whether you’re loading binary fingerprints or not (via the binary argument in A new line parser function (ecfp.lf) is provided to load these types of files, though it’s trivial to write your own. Similarity can be evaluated between feature fingerprints in the usual manner, but the metrics are restricted to Tanimoto and Dice. A function is also available to convert a collection of feature fingerprints into a set of fixed length binary fingerprints ( as described here.

New versions of rcdk (v 3.0.4) and rcdklibs (v have also been uploaded to CRAN. These releases are based on todays CDK 1.4.x branch and resolve a number of bugs and add some new features

  • Correct formula generation
  • Correct handling of SD tags whose values are just white space
  • Proper generation of Murcko frameworks when molecule objects are requested
  • 3 new descriptors – FMF, acidic group count, basic group count

Written by Rajarshi Guha

October 22nd, 2010 at 1:58 am

Posted in cheminformatics

Tagged with , ,

Working with Sequences in R

with one comment

I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried

seqs <- get_sequences_from_db()
seqs <- sapply(seqs, function(x) {

but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.

seqs <- get_sequences_from_db()
seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs))))

A slightly contrived example shows the performance improvement

x <- sapply(1:1000, function(x) {
    paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='')
system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x)))))
system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) )))

Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object

Written by Rajarshi Guha

October 20th, 2010 at 10:11 pm

Posted in software,bioinformatics

Tagged with , ,

Call for Papers – ICCS, 2011

without comments

This has already been posted on some mailing lists, but one more place can’t hurt. The International Conference on Chemical Structures (ICCS) is coming up in June, 2011 at Noordwijkerhout, The Netherlands. I’m on the scientific advisory board and am planning to attend this meeting, as the topics being covered look pretty interesting, especially those focusing on ‘systems’ aspects of cheminformatics and bioinformatics. The abstract submission deadline is January 31, 2011.

C A L L   F O R   P A P E R S
9th International Conference on Chemical Structures
NH Leeuwenhorst Conference Hotel,
Noordwijkerhout, The Netherlands

5-9 June 2011

Visit the conference website at for
more information.

The 9th International Conference on Chemical Structures (ICCS) is
seeking presentations of novel research and emerging technologies for
the following plenary sessions:

o Cheminformatics
> advances in structure representation
> reaction handling and electronic lab notebooks (ELNs)
> molecular similarity and diversity
> chemical information visualization

o Structure-Activity and Structure-Property Prediction
> graphical methods for SAR analysis
> industrialized and large-scale model building
> multi-property prediction and multi-objective optimization

o Structure-Based Drug Design and Virtual Screening
> new docking and scoring approaches
> improved understanding of protein-ligand interactions
> pharmacophore definition and search
> modeling of challenging targets

o Analysis of Large Chemistry Spaces
> mining of chemical literature and patents
> design, profiling and comparison of compound collections and screening sets
> machine learning and knowledge extraction from databases

o Integrated Chemical Information
> advances in chemogenomics
> integration of medical and biological information
> semantic technologies as a driver of integration
> translational informatics

o Dealing with Biological Complexity
> analysis and prediction of poly-pharmacology
> in-silico analysis of toxicology, drug safety, and adverse events
> pathways and biological networks
> druggability of targets

Before and after the official conference program free workshops will be
offered by several companies including BioSolveIT (
and the Chemical Computing Group (

Joint Organizers:
o Division of Chemical Information of the American Chemical Society
o Chemical Structure Association Trust (CSA Trust)
o Division of Chemical Information and Computer Science of the
Chemical Society of Japan (CSJ)
o Chemistry-Information-Computer Division of the Society of German
Chemists (GDCh)
o Royal Netherlands Chemical Society (KNCV)
o Chemical Information Group of the Royal Society of Chemistry (RSC)
o Swiss Chemical Society (SCS)

We encourage the submission of papers on both applications and case
studies as well as on method development and algorithmic work. The final
program will be a balance of these two aspects.

From the submissions the program committee and the scientific advisory
board will select about 30 papers for the plenary sessions. All submissions
that cannot be included in the plenary sessions will automatically be
considered for the poster session.

Contributions can be submitted for any of the above and related areas,
but we also welcome contributions in any aspect of the computer handling
of chemical structure information, such as:

o automatic structure elucidation
o combinatorial chemistry, diversity analysis
o web technology and its effect on chemical information
o electronic publishing
o MM or QM/MM simulations
o practical free energy calculations
o modeling of ADME properties
o material sciences
o analysis and prediction of crystal structures
o grid and cloud computing in cheminformatics

Visit the conference website at for
more information, including details on procedures for online abstract
submission and conference registration.

The deadline for the submission of abstracts is 31 January 2011.

We hope to see you in Noordwijkerhout.

Keith T Taylor, ICCS Chair
Markus Wagener, ICCS Co-Chair

Written by Rajarshi Guha

October 20th, 2010 at 2:56 am

Posted in research

Tagged with ,

A Comment on Fingerprint Performance

without comments

In a comment to my previous post on bit collisions in hashed fingerprints, Asad reported on some interesting points which would be useful to have up here:

Very interesting topic. I have faced these challenges while working with fingerprints and here are few observations from my end. By the way I agree that mathematically the best bet is ~ 13%.

1) The hashed FP (CDK) is good enough to separate patterns which are not common but on a large dataset (in my case 10000+ mols), the performance drops drastically. Top 1% hits were good but then rest of the started to loose specificity (esp when Tanimoto score was around 0.77).

2) First I thought it was an artifact of the Tanimoto score… but I wasn’t convinced spl. in cases where we had rings (close vs open). I ended up writing a new FP based on the pubchem patterns as coded in the CDK and added few more patterns to resize it to 1024 from 881. Well! It’s works like magic and I could find much more serialised hits than before. I think the extensions of the fingerprint which I made based on the patterns in my db also helped.

At the end of the day, I believe that all these searches are heuristic and hashed FP is faster to generate but prone to bit clashes where as SMARTS based FPs are slower to generate (as u spend time in MCS) in matching patterns but they are more sensitive and specific as you can trace the patterns (u get what u see) as the patterns and bitset relationship is know and static

Written by Rajarshi Guha

October 9th, 2010 at 3:23 am

Posted in software,cheminformatics

Tagged with , ,

Hashed Fingerprints and RNG’s

with 10 comments

In my previous post I looked at how many collisions in bit positions were observed when generating hashed fingerprints (using the CDK 1024-bit hashed fingerprint and the Java hashCode method). I summarized the results in the form of “bit collision plots” where I plotted the number of times a bit was set to 1 versus the bit position (for a given molecule). As expected, for a series of molecules we observe a number of collisions in multiple bits. What was a little surprising was that even for a symmetric molecule like triphenylphosphine (i.e., a relatively small number of topologically unique paths), we observed collisions in two bits. So I decided to look into this case in a little more detail.

As I noted, collisions could occur if a) different paths get hashed to the same int or b) two different hashes lead to the same random number. Modifying the Fingerprinter code, I was able to generate the list of paths calculated for triphenylphosphine, the hash code for each path and the bit position that was generated for that hash code. The data (hash value, path, bit position) is given below.

-662168118  P-C:C-H 3
-1466409134 H-C:C-P-C:C:C   58
1279033458  C:C:C:C:C-P-C:C:C   78
1434821739  H-C:C:C:C:C:C   80
-429128489  C:C:C:C-P-C 85
-1779215129 C:C:C:C:C   95
-245205916  H-C:C:C-P-C:C:C:C   97
-475263438  C:C:C:C:C:C-P-C:C   111
-1466409532 H-C:C-P-C:C-H   114
1434821341  H-C:C:C:C:C-H   142
-428753296  C:C:C:C:C:C 161
-245206314  H-C:C:C-P-C:C:C-H   161
1730724873  H-C:C:C-P-C 167
43327460    H-C:C:C:C:C-P-C:C   180
1731099668  H-C:C:C:C-H 180
886716569   H-C:C:C:C   182
-1406488727 C:C:C:C:C-P-C:C 191
78342   P-C 193
1731100066  H-C:C:C:C:C 211
63670037    C:C:C   213
178815835   H-C:C:C:C:C-P-C 224
179190630   H-C:C:C:C:C:C-H 230
-469902821  H-C:C-P-C:C:C:C 244
1057365278  C:C:C:C 253
-181369445  H-C:C:C:C-P-C:C 266
1056990085  C:C-P-C 300
886341376   H-C:C-P-C   333
827737424   H-C:C:C 390
-912402715  P-C:C:C:C:C-H   406
1572927053  H-C:C:C-P-C:C-H 421
1434446546  H-C:C:C:C-P-C   440
403512740   H-C:C:C:C:C:C-P-C   442
827737026   H-C:C-H 448
1572927451  H-C:C:C-P-C:C:C 458
-645296786  P-C:C:C:C:C:C-H 493
-688017645  P-C:C:C-H   503
-763796814  C:C:C:C-P-C:C:C:C   512
66252   C:C 574
75288527    P-C:C   600
-605043036  H-C:C-P-C:C:C:C:C   604
1074261266  H-C:C:C-P-C:C   629
-789313769  C:C:C-P-C:C 639
-2139775602 C:C-P-C:C   675
1370539593  H-C:C-P-C:C 698
284569632   C:C:C:C:C-P-C   702
1797624356  H-C:C:C:C-P-C:C:C   710
63294844    C-P-C   719
-912402317  P-C:C:C:C:C:C   725
72  H   741
67  C   742
80  P   744
-1779590322 C:C:C-P-C   762
1797623958  H-C:C:C:C-P-C:C-H   774
347808169   C:C:C:C-P-C:C:C 788
-688017247  P-C:C:C:C   815
240390684   P-C:C:C:C-H 834
-75615648   C:C:C:C-P-C:C   859
240391082   P-C:C:C:C:C 866
886716171   H-C:C:C-H   888
67900359    H-C:C   951
-1046303447 C:C:C:C:C:C-P-C 957
-662167720  P-C:C:C 969
70654   H-C 971
1678681248  C:C:C-P-C:C:C   979

First, all the hash codes are unique. So clearly the issue lies in the RNG and indeed, we see the following two paths being mapped to the same random integer.

-428753296    C:C:C:C:C:C       161
-245206314    H-C:C:C-P-C:C:C-H 161

Does this mean that the two hash values, when used as seeds to the RNG give the same sequence of random ints? Using the code below

Random rng1 = new Random(-428753296);
Random rng2 = new Random(-245206314);
for (int i = 0; i < 5; i++) {
    System.out.println(rng1.nextInt(1024) + " " + rng2.nextInt(1024));

we generate the first five random integers and we see that they match at the first value but then differ.

161 161
846 40
317 885
461 535
448 982

This suggests that instead of using the first random integer from the RNG seeded by a hash value, we use the second random integer. Modifying the code to do this still gives collisions in two bits. Once again, looking at the paths, hashes and bit positions, we see that now, two different paths get mapped to the same bit position.

886341376      H-C:C-P-C      686
1434821341     H-C:C:C:C:C-H  686

As before, we look at the sequence of random ints obtained from RNG’s seeded using these hash values. The resultant sequence looks like:

333 142
686 686
905 1022
70 571
177 384

So now, the two sequences match at the second value. OK, so what happens if we take the third value from the sequence and use that as a bit position? We get exactly the same behavior (collisions at two bit positions), except that now, when we look at the sequence of random int’s they match at the third value.

This behavior seems a little strange to me – as if there is a pair of seeds such that the “trajectory” of the sequences generated using those seeds will always (?) intersect at a certain point (where point actually corresponds to the n’th element of the sequences).

May be this is a property of random sequences? Or a feature of the Java RNG. I’d love to hear if anybody has insight into this behavior.

Written by Rajarshi Guha

October 4th, 2010 at 7:30 am