So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘graph’ tag

Maximally Bridging Rings (or, Doing What the Authors Should’ve Done)

with one comment

Recently I came across a paper from Marth et al that described a method based on network analysis to support retrosynthetic planning, particularly for complex natural products. I’m no synthetic chemist so I can’t comment on the relevance or importance of the targets or the significance of the proposed approach to planning a synthetic route. What caught my eye was the claim that

This work validates the utility of network analysis as a starting point for identifying strategies for the syntheses of architecturally complex secondary metabolites.

I was a little disappointed (hey, a Nature publication sets certain expectations!) that the network analysis was fundamentally walking the molecular graph to identify a certain type of ring, termed the maximally bridging ring. The algorithm is described in the SI and the authors make it availableCopaene
as an online tool. Unfortunately they didn’t provide any source code for their algorithm, which was a bit irritating, given that the algorithm is a key component of the paper.

I put together an implementation using the CDK (1.5.12), available in a Github repo. It’s a quick hack, using the parameters specified in the paper, and hasn’t been extensively tested. However it seems to give the correct result for the first few test cases in the SI.

The tool will print out the hash code of the rings recognized as maximally bridging and also generate an SVG depiction with the first such ring highlighted in red, such as shown alongside. You can build a self-contained version of the tool as

1
2
3
git clone git@github.com:rajarshi/maxbridgerings.git
cd maxbridgerings
mvn clean package

The tool can then be run (with the depiction output to Copaene.svg)

1
2
java -jar target/MaximallyBridgingRings-1.0-jar-with-dependencies.jar \
  "CC(C)C1CCC2(C3C1C2CC=C3C)C" Copaene

Written by Rajarshi Guha

December 24th, 2015 at 4:10 am

Exploring ChEMBL Targets with Neo4j

without comments

As part of an internal project I’ve recently started working with Neo4j for representing and querying relationships between entities (targets, compounds, etc.). What has really caught my attention is the Cypher graph query language – by allowing you to construct queries using graph notation, many tasks that would be complex or tedious in a traditonal RDBMS become much easier.

As an example, I loaded the ChEMBL target hierarchy and the targets as a graph. On it’s own it’s not particularly useful – the real utility arises when other datasets (and datatypes) are linked to the targets. But even at this stage, one can easily ask questions such as

Find all kinase proteins

which is simply a matter of identifying proteins that have a direct path to the Kinase target class.

Assuming you have ChEMBL loaded in to a MySQL database, you can generate a Neo4j graph database containing the targets and classification hierarchy using code from the neo4jexpt repository. Simply compile and run as (appropriately changing host name, user and password)

1
2
3
$ mvn package
$ java -Djdbc.url="jdbc:mysql://host.name/chembl_20?user=USER&password=PASS" \
       -jar target/neo4j-ctl-1.0-SNAPSHOT.jar graph.db

Once complete, you should see a folder named graph.db. Using the Neo4j application you can then explore the graph in your browser by executing Cypher queries. For example, lets get the graph of the entire ChEMBL target classification hierarchy (and ensuring that we don’t include actual proteins)

1
2
MATCH (n {TargetType:'TargetFamily'})-[r]-(m {TargetType:'TargetFamily'})
  RETURN r

(The various annotations such as TargetType and TargetFamily are based on my code). When visualized we get

prolif-bp

Lets get more specific, and extract the kinase portion of the classification hierarchy

1
2
3
4
MATCH (n {TargetType:'TargetFamily'}),
      (m {TargetID:'Kinase'}),
      p = shortestPath( (n)-[:ChildOf*]->(m) )
  RETURN p

prolif-bp

Given that we’ve linked the protein themselves to the target classes, we can now ask for all proteins that are kinases

1
2
3
4
MATCH (m {TargetType:'MolecularTarget'}),
      (n {TargetID:'Kinase'}),
      p = shortestPath( (m)-[*]->(n) )
  RETURN m

Or identify the target classes that are linked to more than 25 proteins

1
2
3
4
MATCH ()-[r1:IsA]-(m:TargetBiology {TargetType:"TargetFamily"})
  WITH m, COUNT(r1) AS relCount
  WHERE relCount > 25
  RETURN m

which gives us a table of target classes and counts, part of which is shown below

prolif-bp

Overall this seems to be a very powerful platform to integrate data sources and types and effectively query for relationships. The browser based view is useful to practice Cypher and answer questions of the dataset. But a REST API is available as well as other tools such as Gremlin that allow for much more flexible applications and sophisticated queries.

Written by Rajarshi Guha

November 14th, 2015 at 6:10 pm

Tree Widths and Chemical Graphs

without comments

A few days back, Aaron posted a question regarding the use of the tree width of a graph (intuitively, a measure of how tree like a graph is) in a chemical context. The paper that he pointed to was not very informative in terms of chemical applications. The discussion thread then expanded to asking about the utility of this descriptor – could it be used in a QSAR context as a descriptor of molecular structure? Or is it more suitable in a “filtering” scenario, since as Aaron pointed out “Some NP-complete problems become tractable when a graph has bounded treewidth … ” (with graph isomorphism given as an example).

I took a look at the first question – is it a useful descriptor? Yamaguchi et al, seems to indicate that this is a very degenerate descriptor (i.e., different structures give you the same value of the tree width). Luckily, someone had already done the hard work of implementing a variety of algorithms to evaluate tree widths. libtw is a Java library that provides a handy framework to experiment with tree width algorithms. I implemented a simple adapter to convert CDK molecule objects into the graph data structure used by libtw and a driver to process a SMILES file and report the tree width values as well as execution times. While libtw provides a number of tree width algorithms I just used a single one (arbitrarily). The code is available on Github and requires the CDK and libtw jar files to compile and run.

I took a random sample of 10,000 molecules from ChEMBL (also in the Github repository) and evaluated the upper bound of the tree width for each molecule. In addition, I evaluated a few well known topological descriptors for comparison purposes. The four plots summarize the results.

The calculation is certainly very fast, and, surprisingly, doesn’t seem to correlate with molecular size. Apparently, some relatively small molecules take the longest time – but even those are very fast. Unfortunately, the descriptor is indeed degenerate as shown in the top right – a given tree width value shows up for both small and large molecules (the R^2 between number of bonds and tree width is 0.03). The histogram in the lower left indicates that 60% of the molecules had the same value of tree width. In other words, the tree width does not really differentiate bewteen molecular structures (in terms of size or complexity). In contrast, if we consider the Weiner Path index, which has been used extensively in QSAR models, primarily as a measure of branching, we see that it exhibits a much closer relation with molecular size. Other topological measures focusing more specifically on structural complexity such as fragment complexity show similar correlations with molecular size (and with each other).

So in conclusion, I don’t think the tree width is a useful descriptor for modeling purposes.

Written by Rajarshi Guha

February 26th, 2011 at 4:39 pm

Posted in cheminformatics

Tagged with , , ,

Faster Substructure Search in the CDK

with 8 comments

The CDK uses the UniversalIsomorphismTester to perform graph and subgraph isomorphism. However it’s not very efficient and this shows when performing substructure searches over large collections. A quick test where I compared the CDK code to OpenBabel’s obgrep showed that the CDK is nearly forty times slower than OpenBabel. Improvements in this code will enhance SMARTS matching, pharmacophore searching, fingerprinting and descriptors.

The Ullman algorithm is a well known method to perform subgraph isomorphism and even though more than thirty years old, is still used in many applications. I implemented this algorithm, based on the C++ implementation in VFLib, to see whether it’d do better than the method currently used in the CDK.

Read the rest of this entry »

Written by Rajarshi Guha

September 19th, 2008 at 4:12 am