So much to do, so little time

Trying to squeeze sense out of chemical data

Elemental Words

without comments

Last night, my colleague Matthew Hall tweeted

With the recent news of the 7th row of the periodic table being filled I figured this would be a good time to follow up on Matthews request and identify such elemental words.

There are a lot of word lists available online. Being an ex-Scrabble addict, the OSPD came to mind. So using the SOWPODS word list of 267,751 words I put together a quick Python program to identify words that can be constructed from 1- and 2-letter element symbols. (The newly confirmed elements – Uut, Uuo, Uup & Uus – don’t occur in any English words). Importantly, 2-letter elements should exist in a contiguous fashion. This means that a word like ABRI (a shelter) is not an elemental word since it contains Boron & Iodine, but the A and R are not contiguous and so wouldn’t correpsond to Argon. (It could also contain Bromine and Iodine but then the remaining A doesn’t match any element).

The code below takes 4.1s 2.0s to process SOWPODS and identifies 19,698 40,989 “elemental words”. Thanks to Noel O’Boyle for suggesting the use of a regex and directly extracting matches (so avoiding looping over individual words) and Rich Lewis for generating output in element-case.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from __future__ import print_function
import sys, re

if len(sys.argv) != 2:
    print('Usage: code.py WORD_LIST_FILE_NAME')
    sys.exit(0)
   
wordlist = sys.argv[1]
words = open(wordlist, 'r').read()
print('Dictionary has %d words' % (len(re.findall('\n', words))))
with open('elements.txt', 'r') as eles:
    elems = {e.lower(): e for e in eles.read().split() if e != ''}
valid_w = re.findall('(^(?:'+'|'.join(elems.keys())+')+?$)', words, re.I|re.M)
print('Found %d elemental words' % (len(valid_w)))
pattern = re.compile('|'.join(elems.keys()))
elementify = lambda s: pattern.sub(lambda x: elems[x.group()], s)
with open('elemental-%s' % (wordlist), 'w') as o:
    for w in valid_w:
        o.write(elementify(w)+"\n")

Just for fun I also extracted all the titles from Wiktionary, irrespective of language. That gives me a list of 2,726,436 words to examine. After 35s 20s I got 148,211 370,724 “elemental words”.

You can find the code along with the element symbol list and input files in this repository

Update: Thanks to Noels’ suggestion of a regex, I realized my initial implementation had a bug and did not identify all elemental words in a dictionary. The updated code now does, and does it 50% faster

Update:Thanks to Rich Lewis for providing a patch to output matching words in element-case (e.g., AcOUSTiCAl)

Written by Rajarshi Guha

January 3rd, 2016 at 3:07 pm

Maximally Bridging Rings (or, Doing What the Authors Should’ve Done)

with one comment

Recently I came across a paper from Marth et al that described a method based on network analysis to support retrosynthetic planning, particularly for complex natural products. I’m no synthetic chemist so I can’t comment on the relevance or importance of the targets or the significance of the proposed approach to planning a synthetic route. What caught my eye was the claim that

This work validates the utility of network analysis as a starting point for identifying strategies for the syntheses of architecturally complex secondary metabolites.

I was a little disappointed (hey, a Nature publication sets certain expectations!) that the network analysis was fundamentally walking the molecular graph to identify a certain type of ring, termed the maximally bridging ring. The algorithm is described in the SI and the authors make it availableCopaene
as an online tool. Unfortunately they didn’t provide any source code for their algorithm, which was a bit irritating, given that the algorithm is a key component of the paper.

I put together an implementation using the CDK (1.5.12), available in a Github repo. It’s a quick hack, using the parameters specified in the paper, and hasn’t been extensively tested. However it seems to give the correct result for the first few test cases in the SI.

The tool will print out the hash code of the rings recognized as maximally bridging and also generate an SVG depiction with the first such ring highlighted in red, such as shown alongside. You can build a self-contained version of the tool as

1
2
3
git clone git@github.com:rajarshi/maxbridgerings.git
cd maxbridgerings
mvn clean package

The tool can then be run (with the depiction output to Copaene.svg)

1
2
java -jar target/MaximallyBridgingRings-1.0-jar-with-dependencies.jar \
  "CC(C)C1CCC2(C3C1C2CC=C3C)C" Copaene

Written by Rajarshi Guha

December 24th, 2015 at 4:10 am

Cleaning up a WAR in a SBT Build

without comments

Recently some of our software projects have beging using SBT to build them using Build.scala (rather than build.sbt). One of the challenges (apart from learning Scala to build a Java project!) was removing some unneeded JAR files from a WAR build. I use xsbt-web-plugin to provide support for WAR packaging. It turns out that the plugin provides a hook, webappPostProcess to manipulate the web application before generating the WAR. Using this and an example we can write some Scala to list files matching a regex and then delete the matching files

1
2
3
4
5
6
7
8
9
    webappPostProcess := {
      webappDir: File =>
        def recursiveListFiles(f: File, r: Regex): Array[File] = {
          val these = f.listFiles
          val good = these.filter(f => r.findFirstIn(f.getName).isDefined)
          good ++ these.filter(_.isDirectory).flatMap(recursiveListFiles(_,r))
        }
        recursiveListFiles(webappDir / "WEB-INF/lib/", """javax.*""".r).map( x => IO.delete(x))
    }

You could place this in your web app settings Seq to make it available to all WAR builds

Written by Rajarshi Guha

December 22nd, 2015 at 6:00 pm

Posted in software

Tagged with , , , ,

A Model Building IDE?

with 3 comments

Recently I came across a NIPS2015 paper from Vartak et al that describes a system (APIs + visual frontend) to support the iterative model building process. The problem they are addressing is common one in most machine learning settings – building multiple models (different type) using various features and identifying one or more optimal models to take into production. As they correctly point out, most tools such as scikit-learn, SparkML, etc. focus on providing methods and interfaces to build a single model – it’s up to the user to manage the multiple models, keep track of their performance metrics.

sherlock

My first reaction was, “why?”. Part of this stems from my use of the R environment which allows me to build up a infrastructure for building multiple models (e.g., caret, e1701), storing them (list of model objects, RData binary files or even pmml) and subsequent comparison and summarization. Naturally, this is specific to me (so not really a general solution) and essentially a series of R commands – I don’t have the ability to monitor model building progress and simultaneously inspect models that have been built.

In that sense, the authors statement,

For the most part, exploration of models takes place sequentially. Data scientists must write (repeated) imperative code to explore models one after another.

is correct (though maybe, said data scientists should improve their programming skills?). So a system that allows one to organize an exploration of model space could be useful. However, other statements such as

  • Without a history of previously trained models, each model must be trained from scratch, wasting computation and increasing response times.
  • In the absence of history, comparison between models becomes challenging and requires the data scientist to write special-purpose code.

seem to be relevant only when you are not already working in an environment that supports model objects and their associated infrastructure. In my workflow, the list of model object represents my history!

Their proposed system called SHERLOCK is built on top of SparkML and provides an API to model building, a database for model storage and a graphical interface (see above) to all of this. While I’m not convinced of some of the design goals (e.g., training variations of models based on previously trained models (e.g., with different feature sets) – wouldn’t you need to retrain the model from scratch if you choose a new feature set?), it’ll be interesting to see how this develops. Certainly, the UI will be key – since it’s pretty easy to generate a multitude of models with a multitude of features, but in the end a human needs to make sense of it.

On a separate note, this sounds like something RStudio could take a stab at.

Written by Rajarshi Guha

December 7th, 2015 at 3:09 am

Exploring ChEMBL Targets with Neo4j

without comments

As part of an internal project I’ve recently started working with Neo4j for representing and querying relationships between entities (targets, compounds, etc.). What has really caught my attention is the Cypher graph query language – by allowing you to construct queries using graph notation, many tasks that would be complex or tedious in a traditonal RDBMS become much easier.

As an example, I loaded the ChEMBL target hierarchy and the targets as a graph. On it’s own it’s not particularly useful – the real utility arises when other datasets (and datatypes) are linked to the targets. But even at this stage, one can easily ask questions such as

Find all kinase proteins

which is simply a matter of identifying proteins that have a direct path to the Kinase target class.

Assuming you have ChEMBL loaded in to a MySQL database, you can generate a Neo4j graph database containing the targets and classification hierarchy using code from the neo4jexpt repository. Simply compile and run as (appropriately changing host name, user and password)

1
2
3
$ mvn package
$ java -Djdbc.url="jdbc:mysql://host.name/chembl_20?user=USER&password=PASS" \
       -jar target/neo4j-ctl-1.0-SNAPSHOT.jar graph.db

Once complete, you should see a folder named graph.db. Using the Neo4j application you can then explore the graph in your browser by executing Cypher queries. For example, lets get the graph of the entire ChEMBL target classification hierarchy (and ensuring that we don’t include actual proteins)

1
2
MATCH (n {TargetType:'TargetFamily'})-[r]-(m {TargetType:'TargetFamily'})
  RETURN r

(The various annotations such as TargetType and TargetFamily are based on my code). When visualized we get

prolif-bp

Lets get more specific, and extract the kinase portion of the classification hierarchy

1
2
3
4
MATCH (n {TargetType:'TargetFamily'}),
      (m {TargetID:'Kinase'}),
      p = shortestPath( (n)-[:ChildOf*]->(m) )
  RETURN p

prolif-bp

Given that we’ve linked the protein themselves to the target classes, we can now ask for all proteins that are kinases

1
2
3
4
MATCH (m {TargetType:'MolecularTarget'}),
      (n {TargetID:'Kinase'}),
      p = shortestPath( (m)-[*]->(n) )
  RETURN m

Or identify the target classes that are linked to more than 25 proteins

1
2
3
4
MATCH ()-[r1:IsA]-(m:TargetBiology {TargetType:"TargetFamily"})
  WITH m, COUNT(r1) AS relCount
  WHERE relCount > 25
  RETURN m

which gives us a table of target classes and counts, part of which is shown below

prolif-bp

Overall this seems to be a very powerful platform to integrate data sources and types and effectively query for relationships. The browser based view is useful to practice Cypher and answer questions of the dataset. But a REST API is available as well as other tools such as Gremlin that allow for much more flexible applications and sophisticated queries.

Written by Rajarshi Guha

November 14th, 2015 at 6:10 pm