# So much to do, so little time

Trying to squeeze sense out of chemical data

## Exploring ChEMBL Targets with Neo4j

As part of an internal project I’ve recently started working with Neo4j for representing and querying relationships between entities (targets, compounds, etc.). What has really caught my attention is the Cypher graph query language – by allowing you to construct queries using graph notation, many tasks that would be complex or tedious in a traditonal RDBMS become much easier.

As an example, I loaded the ChEMBL target hierarchy and the targets as a graph. On it’s own it’s not particularly useful – the real utility arises when other datasets (and datatypes) are linked to the targets. But even at this stage, one can easily ask questions such as

### Find all kinase proteins

which is simply a matter of identifying proteins that have a direct path to the Kinase target class.

Assuming you have ChEMBL loaded in to a MySQL database, you can generate a Neo4j graph database containing the targets and classification hierarchy using code from the neo4jexpt repository. Simply compile and run as (appropriately changing host name, user and password)

 123 $mvn package$ java -Djdbc.url="jdbc:mysql://host.name/chembl_20?user=USER&password=PASS" \        -jar target/neo4j-ctl-1.0-SNAPSHOT.jar graph.db

Once complete, you should see a folder named graph.db. Using the Neo4j application you can then explore the graph in your browser by executing Cypher queries. For example, lets get the graph of the entire ChEMBL target classification hierarchy (and ensuring that we don’t include actual proteins)

 12 MATCH (n {TargetType:'TargetFamily'})-[r]-(m {TargetType:'TargetFamily'})   RETURN r

(The various annotations such as TargetType and TargetFamily are based on my code). When visualized we get

Lets get more specific, and extract the kinase portion of the classification hierarchy

 1234 MATCH (n {TargetType:'TargetFamily'}),       (m {TargetID:'Kinase'}),       p = shortestPath( (n)-[:ChildOf*]->(m) )   RETURN p

Given that we’ve linked the protein themselves to the target classes, we can now ask for all proteins that are kinases

 1234 MATCH (m {TargetType:'MolecularTarget'}),       (n {TargetID:'Kinase'}),       p = shortestPath( (m)-[*]->(n) )   RETURN m

Or identify the target classes that are linked to more than 25 proteins

 1234 MATCH ()-[r1:IsA]-(m:TargetBiology {TargetType:"TargetFamily"})   WITH m, COUNT(r1) AS relCount   WHERE relCount > 25   RETURN m

which gives us a table of target classes and counts, part of which is shown below

Overall this seems to be a very powerful platform to integrate data sources and types and effectively query for relationships. The browser based view is useful to practice Cypher and answer questions of the dataset. But a REST API is available as well as other tools such as Gremlin that allow for much more flexible applications and sophisticated queries.

Written by Rajarshi Guha

November 14th, 2015 at 6:10 pm

## rinchi – An R package to generate InChI’s and InChI Keys

While trying to update rcdk on CRAN it was pointed out to me that usage of the library resulted in modifications to the users home directory. Specifically, this occurred when generating InChI‘s. The CDK makes use of jni-inchi, which in turn depends on JNATI which enables Java code to work with native libraries in a platform independent fashion. As part of this, it creates \$HOME/.jnati – which is a no-no for CRAN packages. To resolve this, the latest version of rcdklibs excludes the InChI module and its dependencies. Hopefully rcdk and rcdklibs will now pass CRAN QC.

To access InChI functionality in R you can use the rinchi package which is hosted on Github. Since it will modify the users home directory, it cannot be hosted on CRAN. However, it’s easy enough to install

 12 library(devtools) install_github("cdkr", "rajarshi", subdir="rinchi")

Importantly, if all you need is to go from SMILES to InChI, there is no need to install rcdk as well. So the following works

 12 inchi <- get.inchi('CCC') inchik <- get.inchi.key('CCC')

But if you do have a molecule object obtained via rcdk, you can also pass that in to get an InChI or InChI key representation.

Written by Rajarshi Guha

August 30th, 2014 at 6:23 pm

Posted in software,cheminformatics

Tagged with , , , , ,

## Fingerprint Similarity Searches in MongoDB

A few of my recent projects have involved the use of MongoDB, primarily for the ease afforded by a schemaless environment. Sometime back I had investigated the use of MongoDB to store chemical structure data, though those efforts did not actually query structures per se; instead they queried for precomputed numeric or text properties. So my interest was piqued when I came across a post from Datablend that described how to use the aggregation framework to perform similarity searching using fingerprints. Specifically their approach employs an integer representation for fingerprints – these can represent bit positions or hash codes (for path based fingerprints). Another blog post indicates they are able to perform similarity searches over 30M molecules in milliseconds. So I was interested in seeing what type of performance I could get on a local installation, albeit with a smaller set of molecules. All the data and code to regenerate these results are available in the mongosim repository (you’ll need to unzip fp.txt for the loading and profiling scripts).

I extracted 1M compounds from ChEMBL v17 and used the CDK to evaluate the Signature fingerprint. This resulted in 993,620 fingerprints. These were loaded into MongoDB (v2.4.9) using the simple Python script

 12345678910111213141516171819202122232425262728 import pymongo, sys client = pymongo.MongoClient() db = client.sim coll = db.compounds x = open('fp.txt', 'r') x.readline() n = 0 docs = [] for line in x:     n += 1     if line.strip().find(" ") == -1: continue     molregno, bits = line.strip().split(" ")     bits = [int(x) for x in bits.split(",")]     doc = {"molregno":molregno,            "fp":bits,            "fpcount":len(bits),            "smi":""}     docs.append(doc)     if n % 5000 == 0:         coll.insert(docs)         docs = [] coll.create_index(['fpcount',pymongo.ASCENDING])

I then used the first 1000 fingerprints as queries – each time looking for the compounds in the database that exhibited a Tanimoto score greater than 0.9 with the query fingerprint. The aggregation pipeline is shown in profile.py and is pretty much the same as described in the Datablend post. I specifically implement the bounds described by Swamidass and Baldi (which I think Datablend also uses, but the reference seems wrong), allowing me to first filter on bit counts before doing the heavy lifting. All of this was run on a Macbook Pro with 16GB RAM and a single core.

The performance was surprisingly slow. Over a thousand queries, the median query time was 6332ms, with the 95th quantile query time being 7599ms. The Datablend post describing this approach indicated that it got them very good performance and their subsequent post about their Similr service indicates that they achieve millisecond query times on Pubchem sized (30M) collections. I assume there are memory tweaks along with sharding that could let one acheive this level of performance, but there don’t appear to be any details.

I should point out that NCATS has already released code to allow fast similarity search using an in-memory fingerprint index, that supports millisecond query times over Pubchem sized collections.

Written by Rajarshi Guha

July 23rd, 2014 at 2:44 pm

## Accessing Chemistry on the Web Using Firefox

With the profusion of chemical information on the web – in the form of chemical names, images of structures, specific codes (InChI etc), it’s sometimes very useful to be able to seamlessly retrieve some extra information while browsing a page that contains such entities. The usual way is to copy the InChI/SMILES/CAS/name string and paste into Pubchem, Chemspider and so on.

However, a much smoother way is now available via a Firefox extension, called NCATSFind, developed by my colleague. It’s a one click install and once installed, automatically identifies a variety of chemical id codes (CAS number, InChI, UNII) and when such entities are identified uses a variety of backend services to provide context. In addition, it has a cool feature that lets you select an image and generate a structure (using OSRA in the background).

Check out his blog post for more details.

Written by Rajarshi Guha

July 23rd, 2014 at 1:35 pm