In my previous post I looked at how many collisions in bit positions were observed when generating hashed fingerprints (using the CDK 1024-bit hashed fingerprint and the Java hashCode method). I summarized the results in the form of “bit collision plots” where I plotted the number of times a bit was set to 1 versus […]
Path Fingerprints and Hash Quality
Recently, on an email thread I was involved in, Egon mentioned that the CDK hashed fingerprints were probably being penalized by the poor hashing provided by Java’s hashCode method. Essentially, he suspected that the collision rate was high and so that the many bits were being set multiple times by different paths and that a fraction of bits were not […]
The CDK is 10 Years Old
As Egon has pointed out, the CDK project started 10 years ago today tomorrow – congratulations to everybody involved in the project. But also, Egon deserves a huge vote of thanks for keeping the project going – not only in terms of code contributions but also the “grunt” work such as releases, bug fixes, documentation and […]
New Versions of rcdk and rcdklibs
I’ve put released an update to rcdk and rcdklibs on CRAN – right now source packages are available, but binary ones should show up soon. Both packages should be updated together. These packages integrate the CDK into the R environment and simplifies a number of cheminformatics tasks. These versions used CDK 1.3.6 and JCP 16, […]
Pig and Cheminformatics
Pig is a platform for analyzing large datasets. At its core is a high level language (called Pig Latin), that is focused on specifying a series of data transformations. Scripts written in Pig Latin are executed by the Pig infrastructure either in local or map/reduce modes (the latter making use of Hadoop). Previously I had […]