Archive for the ‘fingerprint’ tag
Sometime back John Van Drie and I had developed the Structure Activity Landscape Index (SALI), which is a way to quantify activity cliffs – pairs of compounds which are structurally very similar but have significantly different activities. In preparation for a talk on SALI at the Boston ACS, I was looking for SAR datasets that contained cliffs. It turns out that ChEMBL is a a great resource for SAR data. And with the EBI providing database dumps it’s very easy to query across the entire collection to find datasets of interest.
For the purposes of this talk, I wanted to see what the datasets looked like in terms of the presence (or absence of cliffs). Given that the idea of an activity cliff is only sensible for ligand receptor type interactions, I only considered compound sets associated with binding assays. Furthermore, I only considered those assays which involved human targets, had a confidence score greater than 8 and contained between 75 and 500 molecules. (If you have an Oracle installation of ChEMBL then this SQL snippet will get you the list of assays satisfying these constraints).
This gives us 31 assays, which we can now analyze. For the purposes of this note, I evaluated the CDK hashed fingerprints and used the standardized activities to generate the pairwise SALI values for each of the datasets (performing the appropriate log transformation of the activities when required). The matrices that represent the pairwise SALI values are plotted in the heatmap montage below (the ChEMBL assay ID is noted in each image) where black represents the minimum SALI value and white represents the maximum SALI value for that dataset. (See the original paper for more details on this representation.) Clearly, the “roughness” of the activity landscape differs from dataset to dataset.
At this point I haven’t looked in depth into each dataset to characterize the landscapes in more detail, but this is a quick summary of multiple datasets. (Though a few datasets contain cliffs which are derived from stereoiomers and hence may not actually be real cliffs – since their activity difference may be small, but will look structurally identical to the fingerprint).
An alternative and useful representation is to convert the SALI values for a dataset into an empirical cumulative distribution function to provide a more quantitative view of how cliffs are distributed within a landscape. I’ll leave those details for the talk.
This morning Egon reported that he had implemented a new fingerprinter for the CDK, which only considered hybridization rather than looking at aromaticity. As a result this approach does not require aromaticity perception. I took a quick look to see how it performs in a virtual screening benchmark. Firstly, it’s faster than the other CDK hashed fingerprints – 15,030 fingerprint calculations took ~ 60s with the hybridization only fingerprint. In contrast the extended fingerprint took 80s for the same set of molecules. To test the utility of the fingerprint in a virtual screening scenario I evaluated enrichment curves (see here for a comprehensive comparison of CDK fingerprints) using the AID 692 MUV benchmark dataset. The plots below show the enrichment curves for the first 5% of the database and the entire database. The red curve corresponds to random selections. (In this experiment the database consists of 15,000 decoys and 30 actives). The enrichment factor for the standard, extended and hybiridization only fingerprints were 0.94, 1.06 and 1.38 respectively.
Overall, the hybridization only fingerprint performs comparably to the extended fingerprint and better than the standard one. But at a small percentage of the database screened, it appears that this fingerprint outperforms both. Of course, this is only one dataset, and more MUV datasets should be analyzed to get a more comprehensive view.
I’ve just uploaded a new version of the fingerprint package (v3.3) to CRAN that implements some ideas described in Nisius and Bajorath. First, the balance method generates “balanced code” fingerprints, which given an input fingerprint of N bits, returns a new fingerprint of 2N bits, such that the bit density is exactly 50%. Second, bit.importance is a method to evaluate the importance of each bit in a fingerprint, in terms of the Kullback-Liebler divergence between a collection of actives and background molecules. In other words, the method ranks the bits in terms of their ability to discriminate between the actives and the background molecules.
A few days ago, GSK released an approximately 13,000 member compound library (using the CC0 license) that had been tested for activity against P. falciparum. The structures and data have been deposited into ChEMBL and a paper is available, that describes the screening project and results. Following this announcement there was a thread on FriendFeed, where Jean-Claude Bradley suggested that it might be useful to compare the GSK library with a virtual library of about 117,000 Ugi compounds that he’s been using in the Open Notebook malaria project.
There are many ways to do this type of comparison – ranging from a pairwise similarity search to looking at the overlap of the distribution of compound properties in some pre-defined descriptor space. Given the size of the datasets, I decided to look at a faster, but cruder option using the idea of bit spectra, which is essentially the normalized frequency of bits in a binary fingerprint across a dataset.
I evaluated the 881-bit PubChem fingerprints for the two datasets using the CDK and then evaluated the bit spectra using the fingerprint package in R. We can then compare the datasets (at least in terms of the PubChem fingerprint features) by plotting the bit spectra. The two spectra are pretty similar, suggesting very similar distributions of functional groups. However there are a number of differences. For example, for bit positions 145 – 155, the GSK library has a higher occurrence than the Ugi library. These features focus on various types of 5-member rings. Another region of difference occurs around bit position 300 and then around positions 350-375.
The static visualization shown here is a simple summary of the similarity of the datasets, but with appropriate interactive graphics one could easily focus on the specific regions of interest. Another way would be to evaluate the difference spectrum and quickly identify features that are more prevalent in the Ugi library compared to the GSK library (i.e., positive values in the plot shown here) and vice versa.
On Thursay I joined Antony Williams as a guest lecturer in Jean Claude-Bradleys‘ class on chemical information retrieval at Drexel University. Using a combination of WebEx and Skype, we were able to give our presentations – seamlessly joining three different locations. Technology is great! Tony gave an excellent talk on citizen science and ChemSpider and I spoke about similarity and searching. Jean Claude has also put up an audio version.