# So much to do, so little time

Trying to squeeze sense out of chemical data

## Cryptography & Chemical Structure Search

Encryption of chemical information has not been a very common topic in cheminformatics. There was an ACS symposium in 2005 (summary) that had a number of presentations on the topic of “safe exchange” of chemical information – i.e., exchanging information on chemical structures without sharing the structures themselves. The common thread running through many presentations was to identify representations (a.k.a, descriptors) that can be used for useful computation (e.g., regression or classification models or similarity searches) but do not allow one to (easily) regenerate the structure. Examples include the use of PASS descriptors and various topological indices. Non-descriptor based approaches included, surrogate data (that is structures of related molecules with similar properties) and most recently, scaffold networks. Also, Masek et al, JCIM, 2008 described a procedure to assess the risk of revealing structure information given a set of descriptors.

As indicated by Tetko et al, descriptor based approaches are liable to dictionary based attacks. Theoretically if one fully enumerates all possible molecules and computes the descriptors it would be trivial to obtain the structure of an obfuscated molecule. While this is not currently practical, Masek et al have already shown that an evolutionary algorithm can reconstruct the exact (or closely related) structure from BCUT descriptors in a reasonable time frame and Wong & Burkowski, JCheminf, 2009 described a kernel approach to generating structures from a set of descriptors (though they were considering the inverse QSAR problem rather than chemical privacy). Uptil now I wasn’t aware of approaches that were truly one way – impossible to regenerate the structure from the descriptors, yet also perform useful computations.

Which brings me to an interesting paper by Shimuzu et al which describes a cryptographic approach to chemical structure search, based on homomorphic encryption. A homomorphic encryption scheme allows one to perform computations on the encrypted (usually based on PKI) input leading to an encrypted result, which when decrypted gives the same result as if one had performed the computation on the clear (i.e., unecnrypted) input. Now, a “computation” can involve a variety of operations – addition, multiplication etc. Till recently, most homomorphic schemes were restricted to one or a few operations (and so are termed partially homomorphic). It was only in 2009 that a practical proposal for a fully homomorphic (i.e., supporting arbitrary computations) cryptosystem was described. See this excellent blog post for more details on homomorphic cryptosystems.

The work by Shimuzu et al addresses the specific case of a user trying to identify molecules from a database that are similar to a query structure. They consider a simplified situation where the user is only interested in the count of molecules above a similarity threshold. Two constraints are:

1. Ensure that the database does not know the actual query structure
2. The user should not gain information about the database contents (except for number of similar molecules)

Their scheme is based on a additive homomorphic system (i.e., the only operation supported on the encrypted data is addition) and employs binary fingerprints and the Tversky similarity metric (which can be reduced to Tanimoto if required). I note that they used 166-bit MACCS keys. Since it’s small and each bit position is known it seems that some information could leak out of the encrypted fingerprint or be subject to a dictionary attack. I’d have expected that using a larger hashed fingerprint would have helped improve the security. (Though I suspect that the encryption of the query fingerprint alleviates this issue). Another interesting feature, designed to prevent information about the database leaking back to the user is the use of “dummies” – random, encrypted (by the users public key) integers that are mixed with the true (encrypted) query result. Their design allows the user to determine the sign of the query result (which indicates whether the database molecule is similar to the query, above the specified threshold), but does not let them get the actual similarity score. They show that as the number of dummies is increased, the chances of database information leaking out tends towards zero.

Of course, one could argue that the limited usage of proprietary chemical information (in terms of people who have it and people who can make use of it) means that the efforts put in to obfuscation, cryptography etc. could simply be replaced by legal contracts. Certainly, a simple way to address the scenario discussed here (and noted by the authors) is to download the remote database locally. of course this is not feasible if the remote database is meant to stay private (e.g., a competitors structure database).

But nonetheless, methods that rigorously guarantee privacy of chemical information are interesting from an algorithmic standpoint. Even though Shimuzu et al described a very simplistic situation (though the more realistic scenario where the similar database molecules are returned would obviously negate constraint 2 above), it looks like a step forward in terms of applying formal cryptanalysis to chemical problems and supporting truly safe exchange of chemical information.

Written by Rajarshi Guha

January 5th, 2016 at 3:17 am

## Exploring ChEMBL Targets with Neo4j

As part of an internal project I’ve recently started working with Neo4j for representing and querying relationships between entities (targets, compounds, etc.). What has really caught my attention is the Cypher graph query language – by allowing you to construct queries using graph notation, many tasks that would be complex or tedious in a traditonal RDBMS become much easier.

As an example, I loaded the ChEMBL target hierarchy and the targets as a graph. On it’s own it’s not particularly useful – the real utility arises when other datasets (and datatypes) are linked to the targets. But even at this stage, one can easily ask questions such as

### Find all kinase proteins

which is simply a matter of identifying proteins that have a direct path to the Kinase target class.

Assuming you have ChEMBL loaded in to a MySQL database, you can generate a Neo4j graph database containing the targets and classification hierarchy using code from the neo4jexpt repository. Simply compile and run as (appropriately changing host name, user and password)

 123 $mvn package$ java -Djdbc.url="jdbc:mysql://host.name/chembl_20?user=USER&password=PASS" \        -jar target/neo4j-ctl-1.0-SNAPSHOT.jar graph.db

Once complete, you should see a folder named graph.db. Using the Neo4j application you can then explore the graph in your browser by executing Cypher queries. For example, lets get the graph of the entire ChEMBL target classification hierarchy (and ensuring that we don’t include actual proteins)

 12 MATCH (n {TargetType:'TargetFamily'})-[r]-(m {TargetType:'TargetFamily'})   RETURN r

(The various annotations such as TargetType and TargetFamily are based on my code). When visualized we get

Lets get more specific, and extract the kinase portion of the classification hierarchy

 1234 MATCH (n {TargetType:'TargetFamily'}),       (m {TargetID:'Kinase'}),       p = shortestPath( (n)-[:ChildOf*]->(m) )   RETURN p

Given that we’ve linked the protein themselves to the target classes, we can now ask for all proteins that are kinases

 1234 MATCH (m {TargetType:'MolecularTarget'}),       (n {TargetID:'Kinase'}),       p = shortestPath( (m)-[*]->(n) )   RETURN m

Or identify the target classes that are linked to more than 25 proteins

 1234 MATCH ()-[r1:IsA]-(m:TargetBiology {TargetType:"TargetFamily"})   WITH m, COUNT(r1) AS relCount   WHERE relCount > 25   RETURN m

which gives us a table of target classes and counts, part of which is shown below

Overall this seems to be a very powerful platform to integrate data sources and types and effectively query for relationships. The browser based view is useful to practice Cypher and answer questions of the dataset. But a REST API is available as well as other tools such as Gremlin that allow for much more flexible applications and sophisticated queries.

Written by Rajarshi Guha

November 14th, 2015 at 6:10 pm

## Substructure Searches – High Speed, Large Scale

My NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won’t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available here). He’s provided a web interface allowing one to draw in substructure queries or specify SMILES or SMARTS patterns, and then search for substructures across a snapshot of PubChem (more than 30M structures).

It is blazingly fast.

I decided to run some benchmarks via the REST interface that he provided, using a set of 1000 SMILES derived from an in-house fragmentation of the MLSMR. The 1000 structure subset is available here. For each query structure I record the number of hits, time required for the query and the number of atoms in the query structure. The number of atoms in the query structures ranged from 8 to 132, with a median of 16 atoms.

The figure below shows the distribution of hits matching the query and the time required to perform the query (on the server) for the 1000 substructures. Clearly, the bulk of the queries take less than 1 sec, even though the result set can contain more than 10,000 hits.

The figures below provide another look. On the left, I plot the number of hits versus the size of the query. As expected, the number of matches drops of with the size of the query. We also observe the expected trend between query times and the size of the result sets. Interestingly, while not a fully linear relationship, the slope of the curve is quite low. Of course, these times do not include retrieval times (the structures themselves are stored in an Oracle database and must be retrieved from there) and network transfer times.

Finally, I was also interested in getting an idea of the number of hits returned for a given size of query structure. The figure below summarizes this data, highlighting the variation in result set size for a given number of query atoms. Some of these are not valid (e.g., query structures with 35, 36, … atoms) as there were just a single query structure with that number of atoms.

Overall, very impressive. And it’s something you can play with yourself.

Written by Rajarshi Guha

November 23rd, 2011 at 1:09 am

## HTS and Message Queues

In my previous post I discussed how we’d like to automate some of our screens – starting from the primary screen, going through data processing and compound selection and completing the secondary (follow up) screen. A key feature of such a workflow is the asynchronous nature of the individual steps. Messaging and Message queues (MQ) provide an excellent approach to handling this type of problem.

## Message queue systems

A number of such MQ systems are available such as ActiveMQ, RabbitMQ and so on. See here for a comparison of different MQ systems. Given that we already use Oracle for our backend databases, we use Oracle Advanced Queue (AQ). One advantage of this is that we can store the messages in the database, allowing us to keep a history of a screen as well as use SQL queries to retrieve messages if desired. Such storage can obviously slows things down, but our message throughput is low enough that it doesn’t matter for us.

In this post I’ll briefly describe how I set up a queue on the database side and show the code for a Java application to send a message to the queue and retrieve a message from the queue. The example will actually use the JMS API, which Oracle AQ implements. As a result, the code can trivially swap out AQ for any other JMS implementation.

## Creating queues & tables

The first step is to create a queue table and some queues in the database. The PL/SQL to generate these is

 1234567891011121314 BEGIN DBMS_AQADM.create_queue_table( queue_table => 'test_qt', queue_payload_type => 'SYS.AQ$_JMS_MESSAGE'); DBMS_AQADM.create_queue( queue_table => 'test_qt', queue_name => 'input_q', retention_time => DBMS_AQADM.INFINITE); DBMS_AQADM.start_queue('input_q'); END; / quit So we’ve created a queue table called test_qt which will hold a queue called input_q. The plan is that we’ll have a process listening on this queue and processing each message as it comes and another process that will send a specified number of messages to the queue. The queue_payload_type argument to the create call, indicates that we can store any of the standard JMS message types (though we’ll be focusing on the text message type). We’ve also specified that for the input_q queue, messages will be retained in the database indefinitely. This is useful for debugging and auditing purposes. ## Message producers & consumers OK, with the queues set up, we can now write some Java code to send messages and receive them. In this example, the receiving code will actually run continuously, blocking until messages are received. This example extends TimerTask. The strategy is that when the listener receives a message, it will create a new instance of this task and schedule it immediately on a new thread. As a result the message processing logic is contained within the run method. At this stage, we only consider messages that are of type TextMessage. If that’s the case we simply extract the payload of the message and print it to STDOUT. You’ll note that we also create a unique listener ID and include that in the output. This is handy when we run multiple listeners and want to check that messages are being received by all of them.  123456789101112131415161718192021222324252627282930 public class QueueExample extends TimerTask { static final String URL = "jdbc:oracle:thin:USER/PASSWD@HOST:PORT:SID"; private Message mesg; /* Useful to differentiate between multiple instances of the listener */ private static final String listenerID = UUID.randomUUID().toString(); static final String schema = "wtc"; static final String qTable = "test_qt"; static final String qName = "input_q"; static QueueConnection con = null; static QueueSession sess = null; static javax.jms.Queue q = null; public QueueExample(Message m) { mesg = m; } public void run() { try { if (!(mesg instanceof TextMessage)) return; String payload = ((TextMessage) mesg).getText(); System.out.println(listenerID + ": Got msg: " + payload); } catch (JMSException e) { e.printStackTrace(); } } Before looking at sending and receiving messages we need to initialize the connection to the message queue  12345678910111213141516 private static void initializeQueue() throws JMSException { QueueConnectionFactory queue = AQjmsFactory.getQueueConnectionFactory(URL, new Properties()); QueueConnection con = (QueueConnection) queue.createConnection(); con.start(); sess = (QueueSession) con.createSession(false, Session.AUTO_ACKNOWLEDGE); AQQueueTable qtab = ((AQjmsSession) sess).getQueueTable(schema, qTable); try { q = ((AQjmsSession) sess).getQueue(schema, qName); } catch (Exception ex) { AQjmsDestinationProperty props = new AQjmsDestinationProperty(); q = ((AQjmsSession) sess).createQueue(qtab, qName, props); } } The next step is to listen for messages and dispatch them for processing. The method below initializes the queue if it isn’t already initialized. After creating a consumer object, we simply wait for messages to come in. The receive method is blocking, so the program will wait for the next message. Once a message is received it creates an instance of this class and schedules it – when the thread starts, the run method will execute to process the message.  12345678910111213 public static void listener() throws JMSException { if (q == null) initializeQueue(); System.out.println(listenerID + ": Listening on queue " + q.getQueueName() + "..."); MessageConsumer consumer = sess.createConsumer(q); // each time we get a message, start up the message handler in a new thread for (Message m; (m = consumer.receive()) != null;) { new Timer().schedule(new QueueExample(m), 0); } sess.close(); con.close(); } The final component is to send messages. For this simple example, it’s primarily boiler plate code. In this case, we specify how many messages to send. The DeliveryMode.PERSISTENT indicates that the messages will be stored (in this case in the DB) until a consumer has received it. Note that after receipt by a consumer the message may or may not be stored in the database. See here for more details. In the code below, we can set a variety of properties on the message. For example, we’ve set an “application id” (the JMSXAppID property) and a correlation id. Right now, we ignore this, but it can be used to link messages or even link a message to an external resource (though that could also be done via the payload itself). Another useful property that could be set is the message type via setJMSType. Using this one can assign a MIME type to a message allowing the message processing code to conditionally handle the message based on the type. For more details on the various properties that can be set see Message documentation.  1234567891011121314151617 public static void sender(int n) throws JMSException { if (q == null) initializeQueue(); MessageProducer producer = sess.createProducer(q); producer.setDeliveryMode(DeliveryMode.PERSISTENT); Message msg; for (int i = 0; i < n; i++) { msg = sess.createTextMessage(); msg.setStringProperty("JMSXAppID", "QueueExample"); msg.setJMSCorrelationID(UUID.randomUUID().toString()); ((TextMessage) msg).setText("This is message number " + i); producer.send(msg); } producer.close(); sess.close(); } ## Running The complete source code can be found here. To compile it you’ll need an OJDBC jar file as well as the following jar files (that come with the Oracle installation) •$ORACLE_HOME/rdbms/jlib/aqapi.jar
• $ORACLE_HOME/rdbms/jlib/jmscommon.jar •$ORACLE_HOME/jlib/jndi.jar
• $ORACLE_HOME/jlib/jta.jar •$ORACLE_HOME/rdbms/jlib/xdb.jar
• $ORACLE_HOME/lib/xmlparserv2.jar Once the code has been compiled to a jar file, we first start the listener:  12 guhar$ java -jar dist/qex.jar listen 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Listening on queue input_q...

In another terminal we send some messages

 1 guhar$java -jar dist/qex.jar send 5 Switching to the previous terminal we should see something like  12345 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 0 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 1 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 2 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 3 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 4 The fun starts when we instantiate multiple listeners (possible on different machines). It’s simple enough to execute the first invocation above multiple times and watch the output as we send more messages. If you send 10 messages, you should see that some are handled by one listener and the remainder by another one and so on. if the actual message processing is compute intensive, this allows you to easily distribute such loads easily. ## Next steps The code discussed here is a minimalistic example of sending and receiving messages from a queue. In the next post, I’ll discuss how we can represent messages in the database using a custom message type (defined in terms of an Oracle ADT) and send and receive such messages using Java. Such custom message types allow the Java code to remain object oriented, with the AQ libraries handling serialization and deserialization of the messages between our code and the queue. One of the downsides that I see with Oracle AQ is that the only clients supported are PL/SQL, C and Java. While AQ implements the JMS API, it employs its own wire protocol. The lack of support for AMQP means that a lot of client libraries in other languages cannot be used to send or retrieve messages from AQ. If anybody knows of Python packages that work with Oracle AQ I’d love to hear about them. (Looks like stomppy might support AQ?) Written by Rajarshi Guha July 11th, 2010 at 9:00 pm Posted in software Tagged with , , , , , ## Molecules & MongoDB – Numbers and Thoughts with 7 comments In my previous post I had mentioned that key/value or non-relational data stores could be useful in certain cheminformatics applications. I had started playing around with MongoDB and following Rich’s example, I thought I’d put it through its paces using data from PubChem. Installing MongoDB was pretty trivial. I downloaded the 64 bit version for OS X, unpacked it and then simply started the server process:  1$MONGO_HOME/bin/mongod --dbpath=$HOME/src/mdb/db where$HOME/src/mdb/db is the directory in which the database will store the actual data. The simplicity is certainly nice. Next, I needed the Python bindings. With easy_install, this was quite painless. At this point I had everything in hand to start playing with MongoDB.

### Getting data

The first step was to get some data from PubChem. This is pretty easy using via their FTP site. I was a bit lazy, so I just made calls to wget, rather than use ftplib. The code below will retrieve the first 80 PubChem SD files and uncompress them into the current directory.

 12345678910111213 import glob, sys, os, time, random, urllib def getfiles():     n = 0     nmax = 80     for o in urllib.urlopen('ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/').read()         o = o.strip().split()[5]         os.system('wget %s/%s' % ('ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/', o))         os.system('gzip -d %s' % (o))         n += 1         sys.stdout.write('Got n = %d, %s\r' % (n,o))         sys.stdout.flush()         if n == nmax: return

This gives us a total of 1,641,250 molecules.

With the MongoDB instance running, we’re ready to connect and insert records into it. For this test, I simply loop over each molecule in each SD file and create a record consisting of the PubChem CID and all the SD tags for that molecule. In this context a record is simply a Python dict, with the SD tags being the keys and the tag values being the values. Since i know the PubChem CID is unique in this collection I set the special document key “_id” (essentially, the primary key) to the CID. The code to perform this uses the Python bindings to OpenBabel:

 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647 from openbabel import * import glob, sys, os from pymongo import Connection from pymongo import DESCENDING def loadDB(recreate = True):     conn = Connection()     db = conn.chem     if 'mol2d' in db.collection_names():         if recreate:             print 'Deleting mol2d collection'             db.drop_collection('mol2d')         else:             print 'mol2d exists. Will not reload data'             return     coll = db.mol2d     obconversion = OBConversion()     obconversion.SetInFormat("sdf")     obmol = OBMol()     n = 0     files = glob.glob("*.sdf")     for f in files:         notatend = obconversion.ReadFile(obmol,f)         while notatend:             doc = {}             sdd = [toPairData(x) for x in obmol.GetData() if x.GetDataType()==PairData]             for entry in sdd:                 key = entry.GetAttribute()                 value = entry.GetValue()                 doc[key] = value             doc['_id'] = obmol.GetTitle()             coll.insert(doc)             obmol = OBMol()             notatend = obconversion.Read(obmol)             n += 1             if n % 100 == 0:                 sys.stdout.write('Processed %d\r' % (n))                 sys.stdout.flush()     print 'Processed %d molecules' % (n)     coll.create_index([ ('PUBCHEM_HEAVY_ATOM_COUNT', DESCENDING)  ])     coll.create_index([ ('PUBCHEM_MOLECULAR_WEIGHT', DESCENDING)  ])

Note that this example loads each molecule on its own and takes a total of 2015.020 sec. It has been noted that bulk loading (i.e., insert a list of documents, rather than individual documents) can be more efficient. I tried this, loading 1000 molecules at a time. But this time round the load time was  2224.691 sec – certainly not an improvement!

Note that the “_id” key is a “primary key’ and thus queries on this field are extremely fast. MongoDB also supports indexes and the code above implements an index on the PUBCHEM_HEAVY_ATOM_COUNT field.

### Queries

The simplest query is to pull up records based on CID. I selected 8000 CIDs randomly and evaluated how long it’d take to pull up the records from the database:

 12345678 from pymongo import Connection def timeQueryByCID(cids):     conn = Connection()     db = conn.chem     coll = db.mol2d     for cid in cids:         result = coll.find( {'_id' : cid} ).explain()

The above code takes 2351.95 ms, averaged over 5 runs. This comes out to about 0.3 ms per query. Not bad!

Next, lets look at queries that use the heavy atom count field that we had indexed. For this test I selected 30 heavy atom count values randomly and for each value performed the query. I retrieved the query time as well as the number of hits via explain().

 12345678910111213 from pymongo import Connection def timeQueryByHeavyAtom(natom):     conn = Connection()     db = conn.chem     coll = db.mol2d     o = open('time-natom.txt', 'w')     for i in natom:         c = coll.find( {'PUBCHEM_HEAVY_ATOM_COUNT' : i} ).explain()         nresult = c['n']         elapse = c['millis']         o.write('%d\t%d\t%f\n' % (i, nresult, elapse))     o.close()

A summary of these queries is shown in the graphs below.

One of the queries is anomalous – there are 93K molecules with 24 heavy atoms, but the query is performed in 139 ms. This might be due to priming while I was testing code.

### Some thoughts

One thing that was apparent from the little I’ve played with MongoDB is that it’s extremely easy to use. I’m sure that larger installs (say on a cluster) could be more complex, but for single user apps, setup is really trivial. Furthermore, basic operations like insertion and querying are extremely easy. The idea of being able to dump any type of data (as a document) without worrying whether it will fit into a pre-defined schema is a lot of fun.

However, it’s advantages also seem to be its limitations (though this is not specific to MongoDB). This was also noted in a comment on my previous post. It seems that MongoDB is very efficient for simplistic queries. One of the things that I haven’t properly worked out is whether this type of system makes sense for a molecule-centric database. The primary reason is that molecules can be referred by a variety of identifiers. For example, when searching PubChem, a query by CID is just one of the ways one might pull up data. As a result, any database holding this type of data will likely require multiple indices. So, why not stay with an RDBMS? Furthermore, in my previous post, I had mentioned that a cool feature would be able to dump molecules from arbitrary sources into the DB, without worrying about fields. While very handy when loading data, it does present some complexities at query time. How does one perform a query over all molecules? This can be addressed in multiple ways (registration etc.) but is essentially what must be done in an RDBMS scenario.

Another things that became apparent is the fact that MongoDB and its ilk don’t support JOINs. While the current example doesn’t really highlight this, it is trivial to consider adding say bioassay data and then querying both tables using a JOIN. In contrast, the NoSQL approach is to perform multiple queries and then do the join in your own code. This seems inelegant and a bit painful (at least for the types of applications that I work with).

Finally, one of my interests was to make use of the map/reduce functionality in MongoDB. However, it appears that such queries must be implemented in Javascript. As a result, performing cheminformatics operations (using some other language or external libraries) within map or reduce functions is not currently possible.

But of course, NoSQL DB’s were not designed to replace RDBMS. Both technologies have their place, and I don’t believe that one is better than the other. Just that one might be better suited to a given application than the other.

Written by Rajarshi Guha

February 8th, 2010 at 2:18 am