# So much to do, so little time

Trying to squeeze sense out of chemical data

## Maximally Bridging Rings (or, Doing What the Authors Should’ve Done)

with one comment

Recently I came across a paper from Marth et al that described a method based on network analysis to support retrosynthetic planning, particularly for complex natural products. I’m no synthetic chemist so I can’t comment on the relevance or importance of the targets or the significance of the proposed approach to planning a synthetic route. What caught my eye was the claim that

This work validates the utility of network analysis as a starting point for identifying strategies for the syntheses of architecturally complex secondary metabolites.

I was a little disappointed (hey, a Nature publication sets certain expectations!) that the network analysis was fundamentally walking the molecular graph to identify a certain type of ring, termed the maximally bridging ring. The algorithm is described in the SI and the authors make it available
as an online tool. Unfortunately they didn’t provide any source code for their algorithm, which was a bit irritating, given that the algorithm is a key component of the paper.

I put together an implementation using the CDK (1.5.12), available in a Github repo. It’s a quick hack, using the parameters specified in the paper, and hasn’t been extensively tested. However it seems to give the correct result for the first few test cases in the SI.

The tool will print out the hash code of the rings recognized as maximally bridging and also generate an SVG depiction with the first such ring highlighted in red, such as shown alongside. You can build a self-contained version of the tool as

 123 git clone git@github.com:rajarshi/maxbridgerings.git cd maxbridgerings mvn clean package

The tool can then be run (with the depiction output to Copaene.svg)

 12 java -jar target/MaximallyBridgingRings-1.0-jar-with-dependencies.jar \   "CC(C)C1CCC2(C3C1C2CC=C3C)C" Copaene

Written by Rajarshi Guha

December 24th, 2015 at 4:10 am

## Drug-Target Networks & Polypharmacology

I came across Takigawa et al where they address polypharmacology by investigating drug-target pairs.  Their approach is to simultaneously identify substructures from the ligand and subsequences from the target and combine this information to suggest drug-target pairs that represent some form of polypharmacology.  More specifically their hypothesis is that “polypharmacological principles” are embedded in a special set of paired fragments (substructures on the ligand side, subsequence on the target side). When you think about it, this is a more generalized (abstract?) version of a pharmacophore that makes the role of the target explicit.

Their approach originates from two assumptions

These results suggest that targets of promiscuous drugs can be dissimilar, implying that only a small part of each target is related with the principle of polypharmacology.

and

Similarly, recent research shows that smaller drugs in molecular weight are likely to be more promiscuous, suggesting that small fragments in each ligand would be a key to drug promiscuity

… that paired fragments significantly shared in drug-target pairs could be crucial factors behind polypharmacology.

Based on this idea they first apply a frequent itemset algorithm to identify pairs of subgraph (SG) and subsequences (SS), that occur frequently (more than 5%) in the drug-target pairs. After identifying about 10,000 such SS-SG pairs, they define a sparse fingerprint, where each bit corresponds to one such pair. Using these fingerprints they then cluster the drug-target pairs, ending up with a selection of clusters. They then propose that individual clusters represent distinct polypharmacologies.

Our significant substructure pairs partitioned drug-target pairs covering most of approved drugs into clusters, which were clearly separated from each other, implying that each cluster corresponds to a unique polypharmacology type

While the underlying algorithms to obtain their results are nice, a lot of things weren’t clear.

Foremost, given the above quote, it’s not exactly clear from the paper what is meant by “unique polypharmacology type? Given that a cluster will consist of multiple drugs and multiple targets, it is not apparent from the text that a cluster highlights either promiscuity of compounds or ligand preferences for a small number of targets. While I think this is a major issue there are some other lesser problems

• I get the impression that they consider promiscuity and polypharmacology as equivalent concepts. While there is a degree of similarity, I’d regard polypharmacology more as a rationally, controlled type of promiscuity
• Most fragments they highlight in Figure 2 are relatively trivial paths. Certainly, reactive groups can lead to promiscuity; none of the subgraphs list exhibit reactive functionality and their application of the frequent itemset method, using a support of 5% could easily filter these out
• Given they consider arbitrary subsequences of the target, the resulting associations could be meaningless. Again, it’d be interesting to note, in cases where crystal structure is available, how many of the subsequences, in the list of significant SS-SG pairs, lie in or around the binding site. A related question would be, of the SG-SS pairs associated with a cluster, how are individual subsequences distributed? Few unique subsequences could point towards a common binding site or active domain.
• Related to the previous point, it’d be interesting to see in how many of the SG-SS paired fragments, the members correspond to actual interacting motifs (again based on crystal structure data).
• One could argue that just using string subsequences to characterize the target misses information on important ligand-target interactions.

And while they may be the first to consider an analysis of drug-target pairs specifically, the idea of considering ligand and target simultaneously is not new. For example, the SiFT approach is quite similar and was described in 2004.

So, even though the paper seems pretty fuzzy on the supposed polypharmacology that they identify, it is overall an interesting paper (and one of the more interesting cheminformatics applications of frequent itemset methods).

Written by Rajarshi Guha

March 9th, 2011 at 6:07 am

## Getting the GO into a Graph Data Structure

Today while working on a project I needed to get access to the Gene Ontology hierarchy. While there a number of GO browsers such as Amigo, I needed access to the raw data to generate a graph that I could then slice and dice. A few minutes with Python led to a simple solution.

The program parses the OBO 1.2 formatted GO data file (either by directly downloading it or from a local file) and outputs a flat dictionary listing the term ID’s, names, namespace etc and a network representation of the GO hierarchy in ncol format. It uses a simple  (and relatively non-robust) class to represent the data as an undirected graph (not really correct), though it’d be easy to use something like igraph to start doing some real network analysis. It’s certainly not a comprehensive solution, but I thought I’d put it out there.

Written by Rajarshi Guha

January 31st, 2009 at 1:34 am

Posted in software

Tagged with , ,

## Annotating Bioassays

I’ve been working for some time with the PubChem Bioassay collection – a set of 1293 assays that cover a range of techniques (enzymatic, phenotypic etc.), targets and sizes (from 20 molecules to 200,000 molecules). In addition, some assays are primary, high-throughput assays whereas a number of them are smaller, confirmatory assays. While an extremely valuable collection, one of the drawbacks is the lack of curation. This has led to some people saying that the data is too noisy to be useful. Yes, the noise is a problem, but I think there’s still useful data to extract and model.

One of the problems that I have faced is that while one can perform a full text search for assays on PubChem, there is no form of annotations on the assays themselves. One effect of this is that it is difficult to link an assay to other biological resources (though for enzymatic assays, one can determine a Pubmed protein identifier). While working on my bioassay network project, I needed annotations and I didn’t want to do it manually.

Written by Rajarshi Guha

January 25th, 2009 at 5:03 pm