Inserting 2D Depictions into R Plots

Recent versions of rcdk allow you to insert images of chemical structures into R plots, via the view.image.2d and rasterImage functions. One problem with the latter function is that the 2D structure image must be located in plot units, rather than pixel units. Paul Murrell suggested an easy way to insert the raster image into the plot region, maintaining the  native resolution of the image:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(rcdk)
m <- parse.smiles("O=C(C1=CC=CC=C1)C1=CC=CC=C1")[[1]]
img <- view.image.2d(m, 200,200)
plot(10:1, pch=19)

## Position the depiction at the lower left corner
dpi <- (par("cra")/par("cin"))[1]
usr <- par("usr")
xl <- usr[1]
yb <- usr[3]
xr <- xl + xinch(200/dpi)
yt <- yb + yinch(200/dpi)

rasterImage(img, xl,yb, xr,yt)

PAINS Substructure Filters as SMARTS

Sometime back Baell et al published an interesting paper describing a set of substructure filters to identify compounds that are promiscuous in high throughput biochemical screens. They termed these compounds Pan Assay Interference Compounds or PAINS. There are a variety of functional groups that are known to be problematic in HTS assays. The reasons for exclusion of molecules with these and other groups range from reactivity towards proteins to poor developmental potential or known toxicity. Derek Lowe has a nice summary of the paper.

The paper published the substructure filters as a collection of Sybyl Line Notation (SLN) patterns. Unfortunately, without access to Sybyl, it’s difficult to reuse the published patterns. Having them in  SMARTS form would allow one to use them with many more (open source or commercial) tools. Luckily, Wolf Ihlenfeldt came to the rescue and provide me access to a version of the CACTVS toolkit that was able to convert the SLN patterns to SMARTS.

There are three files, p_l15, p_l150 and p_m150 corresponding to tables S8, S7 and S6 from the supplementary information. The first column is the pattern and the second column is the name for that pattern taken from the original SLN files. While all patterns were converted to SMARTS, the conversion process is not perfect as I have not been able to reproduce (using the OEChem toolkit with the Tripos aromaticity model) all the hits that were obtained using the original SLN patterns.

(As a side note, the SMARTSViewer is a really handy tool to visualize a SMARTS pattern – which is great since many of the PAINS patterns are very complex)

Updates to R Packages

I’ve uploaded a new version of fingerprint (v 3.4) which now supports feature fingerprints – fingerprints that are represented as variable length vectors of numbers or strings. An example would be circular fingerprints. Now, when reading fingerprints you have to indicate whether you’re loading binary fingerprints or not (via the binary argument in fp.read). A new line parser function (ecfp.lf) is provided to load these types of files, though it’s trivial to write your own. Similarity can be evaluated between feature fingerprints in the usual manner, but the metrics are restricted to Tanimoto and Dice. A function is also available to convert a collection of feature fingerprints into a set of fixed length binary fingerprints (featvec.to.binaryfp) as described here.

New versions of rcdk (v 3.0.4) and rcdklibs (v 1.3.6.3) have also been uploaded to CRAN. These releases are based on todays CDK 1.4.x branch and resolve a number of bugs and add some new features

  • Correct formula generation
  • Correct handling of SD tags whose values are just white space
  • Proper generation of Murcko frameworks when molecule objects are requested
  • 3 new descriptors – FMF, acidic group count, basic group count

Working with Sequences in R

I’ve been working on some RNAi projects and part of that involved generating descriptors for sequences. It turns out that the Biostrings package is very handy and high performance. So, our database contains a catalog for an siRNA library with ~ 27,000 target DNA sequences. To get at the siRNA sequence, we need to convert the DNA to RNA and then take the complement of the RNA sequence. Obviously, you could a write a function to do the transcription step and the complement step, but the Biostrings package already handles that. So I naively tried

1
2
3
4
seqs <- get_sequences_from_db()
seqs <- sapply(seqs, function(x) {
  as.character(complement(RNAString(DNAString(x))))
})

but for the 27,000 sequences it took longer than 5 minutes. I then came across the XStringSet class and it’s subclasses, DNAStringSet and RNAStringSet. Using this method got me the siRNA sequences in less than a second.

1
2
seqs <- get_sequences_from_db()
seqs <- as.character(complement(RNAStringSet(DNAStringSet(seqs))))

A slightly contrived example shows the performance improvement

1
2
3
4
5
x <- sapply(1:1000, function(x) {
    paste(sample(c('A', 'T', 'C', 'G'), 21, replace=TRUE), collapse='')
})
system.time(y <- as.character(complement(RNAStringSet(DNAStringSet(x)))))
system.time(y <- sapply(x, function(z) as.character(complement(RNAString(DNAString(z))) )))

Ideally, my descriptor code would also operate directly on a RNAString object, rather than requiring a character object

Call for Papers – ICCS, 2011

This has already been posted on some mailing lists, but one more place can’t hurt. The International Conference on Chemical Structures (ICCS) is coming up in June, 2011 at Noordwijkerhout, The Netherlands. I’m on the scientific advisory board and am planning to attend this meeting, as the topics being covered look pretty interesting, especially those focusing on ‘systems’ aspects of cheminformatics and bioinformatics. The abstract submission deadline is January 31, 2011.

C A L L   F O R   P A P E R S
9th International Conference on Chemical Structures
NH Leeuwenhorst Conference Hotel,
Noordwijkerhout, The Netherlands

5-9 June 2011

Visit the conference website at www.int-conf-chem-structures.org for
more information.

The 9th International Conference on Chemical Structures (ICCS) is
seeking presentations of novel research and emerging technologies for
the following plenary sessions:

o Cheminformatics
> advances in structure representation
> reaction handling and electronic lab notebooks (ELNs)
> molecular similarity and diversity
> chemical information visualization

o Structure-Activity and Structure-Property Prediction
> graphical methods for SAR analysis
> industrialized and large-scale model building
> multi-property prediction and multi-objective optimization

o Structure-Based Drug Design and Virtual Screening
> new docking and scoring approaches
> improved understanding of protein-ligand interactions
> pharmacophore definition and search
> modeling of challenging targets

o Analysis of Large Chemistry Spaces
> mining of chemical literature and patents
> design, profiling and comparison of compound collections and screening sets
> machine learning and knowledge extraction from databases

o Integrated Chemical Information
> advances in chemogenomics
> integration of medical and biological information
> semantic technologies as a driver of integration
> translational informatics

o Dealing with Biological Complexity
> analysis and prediction of poly-pharmacology
> in-silico analysis of toxicology, drug safety, and adverse events
> pathways and biological networks
> druggability of targets

Before and after the official conference program free workshops will be
offered by several companies including BioSolveIT (www.biosolveit.de)
and the Chemical Computing Group (www.chemcomp.com).

Joint Organizers:
o Division of Chemical Information of the American Chemical Society
(CINF)
o Chemical Structure Association Trust (CSA Trust)
o Division of Chemical Information and Computer Science of the
Chemical Society of Japan (CSJ)
o Chemistry-Information-Computer Division of the Society of German
Chemists (GDCh)
o Royal Netherlands Chemical Society (KNCV)
o Chemical Information Group of the Royal Society of Chemistry (RSC)
o Swiss Chemical Society (SCS)

We encourage the submission of papers on both applications and case
studies as well as on method development and algorithmic work. The final
program will be a balance of these two aspects.

From the submissions the program committee and the scientific advisory
board will select about 30 papers for the plenary sessions. All submissions
that cannot be included in the plenary sessions will automatically be
considered for the poster session.

Contributions can be submitted for any of the above and related areas,
but we also welcome contributions in any aspect of the computer handling
of chemical structure information, such as:

o automatic structure elucidation
o combinatorial chemistry, diversity analysis
o web technology and its effect on chemical information
o electronic publishing
o MM or QM/MM simulations
o practical free energy calculations
o modeling of ADME properties
o material sciences
o analysis and prediction of crystal structures
o grid and cloud computing in cheminformatics

Visit the conference website at http://www.int-conf-chem-structures.org for
more information, including details on procedures for online abstract
submission and conference registration.

The deadline for the submission of abstracts is 31 January 2011.

We hope to see you in Noordwijkerhout.

Keith T Taylor, ICCS Chair
Markus Wagener, ICCS Co-Chair