Recently, on an email thread I was involved in, Egon mentioned that the CDK hashed fingerprints were probably being penalized by the poor hashing provided by Java’s hashCode method. Essentially, he suspected that the collision rate was high and so that the many bits were being set multiple times by different paths and that a fraction of bits were not […]
The CDK is 10 Years Old
As Egon has pointed out, the CDK project started 10 years ago today tomorrow – congratulations to everybody involved in the project. But also, Egon deserves a huge vote of thanks for keeping the project going – not only in terms of code contributions but also the “grunt” work such as releases, bug fixes, documentation and […]
New Versions of rcdk and rcdklibs
I’ve put released an update to rcdk and rcdklibs on CRAN – right now source packages are available, but binary ones should show up soon. Both packages should be updated together. These packages integrate the CDK into the R environment and simplifies a number of cheminformatics tasks. These versions used CDK 1.3.6 and JCP 16, […]
Author Count Frequencies in PubMed
Earlier today, Emily Wixson posted a question on the CHMINF-L list asking … if there is any way to count the number of authors of papers with specific keywords in the title by year over a decade … Since I had some code compiling and databases loading I took a quick stab, using Python and […]
Pig and Cheminformatics
Pig is a platform for analyzing large datasets. At its core is a high level language (called Pig Latin), that is focused on specifying a series of data transformations. Scripts written in Pig Latin are executed by the Pig infrastructure either in local or map/reduce modes (the latter making use of Hadoop). Previously I had […]