Cheminformatics in R – rcdk

Being an R aficionado, I do the bulk of my work in R and having grown up with Emacs I tend to dislike having to exit my environment to do “other” stuff. This was the motivation for integrating R and the CDK, so that I could access and manipulate chemical information from within my R session. This resulted in the rcdk package.

Since then there have been a lot of improvements in the CDK and so the latest version (2.9.2) of rcdk includes them and also provides access to much more of the CDK via R idioms. As the original J. Stat. Soft. paper is now pretty much deprecated, we have included a tutorial in the form of a vignette. The latest version of rcdk is now much smaller, since we have split out the actual CDK libraries into a separate package called rcdklibs. This allows us to release new versions of rcdk, without requiring a bulky download each time, since rcdklibs should change at a slower pace. I’d also like to thank Miguel Rojas Cherto for his contributions to this version of rcdk (as well as to rpubchem).

So what can you do with rcdk? Installation is pretty simple – just point your favorite interface to CRAN  (or  a mirror) and it should get it along with all the dependencies. After loading the library, you can read in any file format that the CDK supports or directly parse a SMILES

mols <- load.molecules("mymols.sdf")
mol.smiles <- parse.smiles("CC(=O)Cc1cc(Cl)ccc1")

which gives you a list of molecule objects. Note that these objects are actually pointers to Java objects and so you can’t serialize these via R’s save command. This is a pain and so I’m planning to implement some code generators that will create S4 classes directly from the Java class definitions.

Once you have a molecule object you can do a variety of things:

## view molecule depictions

## evaluate fingerprints
fps <- get.fingerprints(mols, type="maccs")

## generate descriptors
dnames <- get.desc.names("topological")
descs <- eval.desc(mols, dnames)

One problem with the depiction code is that it does not work well on OS X. This is due to interactions between rJava and the R event handling loop. As a result, depictions show up, but then you can’t interact with the window. It does work fine on Linux and Windows. To easily handle fingerprints, I suggest the use of the fingerprint package. There are also methods to easily access atoms, bonds, molecule properties and so on.

4 thoughts on “Cheminformatics in R – rcdk

  1. Alex says:

    Hi, I recently installed rcdk in R 2.10.1 on Windows 7 (64 bit), however I cannot get view.molecule.2d to run. I get a “Currently disabled” message; if I comment that out, I get the following:

    Error in .jnew(“org/guha/rcdk/view/ViewMolecule2DTable”, array, as.integer(ncol), :
    java.lang.NoClassDefFoundError: org/guha/rcdk/view/ViewMolecule2DTable

    Would you have any idea what the problem is?

  2. Yes, that is disabled because I haven’t gotten round to updating it. You get the Java error since the underlying Java code is in flux and needs updating.

    BTW, you should use the latest version (2.9.21) which will require updating R to 2.11

  3. Alireza says:

    I am wondering whats the difference between fingerprints and descriptors in practice ?! Why their corresponding packages are seperated and which one should be use in which condition ?!

    As far as I understand from your nice tutorial Fingerprints are more appropriate to make similarity matrix while descriptors for prediction ?! Is it true ?! (however doing similarity and prediction for me sounds the same thing !)

    I would be appropriate if you elaborate this issue a bit more

    • Fingerprints are in fact descriptors – they can be used much the same manner. The fingerprint package focuses on manipulating fingerprint data – the rcdk package calculates both fingerprints and descriptors

Leave a Reply

Your email address will not be published. Required fields are marked *