So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘database’ tag

Database Licensing & Sustainability

without comments

A few days back I came across, via my Twitter network, the news that DrugBank had changed their licensing policy to CC BY-SA-NC 4.0. As such this is not a remarkable change (though one could argue about the NC clause, since as John Overington points out the distinction between commercial and non-commercial usage can be murky). However, on top of this license, the EULA listed a number of more restrictive conditions on reuse of the data. See this thread on ThinkLab for a more detailed discussion and breakdown.

This led to discussion amongst a variety of people regarding the sustainability of data resources. In this case while DrugBank was (and is) funded by federal grants, these are not guaranteed in perpetuity. And thus DrugBank, and indeed any resource, needs to have a plan to sustain itself. Charging for commercial access is one such approach. While it can be  problematic for reuse and other Open projects, one cannot fault the developers if they choose a path that enables them to continue to build upon their work.

Interestingly, the Guide to Pharmacology resource posted a response to the DrugBank license change, in which they don’t comment on the DrugBank decision but do point out that

The British Pharmacological Society (BPS) has committed support for GtoPdb until 2020 and the Wellcome Trust support for GtoImmuPdb until 2018. Needless to say the management team (between, IUPHAR, BPS and the University of Edinburgh) are engaged in sustainability planning beyond those dates. We have also just applied for UK ELIXIR Node consideration.

So it’s nice to see that the resource is completely free of any onerous restrictions until 2020. I have no doubt that the management team will be working hard to secure funding beyond that date. But in case they don’t, will their licensing also change to support some form of commercialization? Certainly, other resources are going down that path. John Overington pointed to BioCyc switching to a subscription model

 

So the sustainability of data resources is an ongoing problem, and will become a bigger issue as the links between resources grows over time. Economic considerations would suggest that permanent funding of every database  cannot happen.

So clearly, some resources will win and some will lose, and the winners will not stay winners forever.

Open source software & transferring leadership

However in contrast to databases, many Open Source software projects do continue development over pretty long time periods. Some of these projects receive public funding and also provide dual licensing options, allowing for income from industrial users.

However there are others which are not heavily funded, yet continue to develop. My favorite example is Jmol which has been in existence for more than 15 years and has remained completely Open Source. One of the key features of this project is that the leadership has passed from one individual to another over the years, starting I think with Dan Gezelter, then Bradley Smith, Egon Willighagen, Miguel Rojas and currently Bob Hanson.

Comparing Open software to Open databases is not fully correct. But this notion of leadership transition is something that could play a useful role in sustaining databases. Thus, if group X cannot raise funding for continued development, maybe group Y (that obviously benefits from the database) that has funding, could take over development and maintenance.

There are obvious reasons that this won’t work – maybe the expertise resides only in group X? I doubt this is really an issue, at least for non-niche databases. One could also argue that this approach is a sort of proto-crowdsourcing approach. While crowdsourcing did come up in the Twitter thread, I’m not convinced this is a scalable approach to sustainability. The “diffuse motivation” of a crowd is quite distinct from the “focused motivation” of a dedicated group. And on top of that, many databases are specialized and the relevant crowd is rather small.

One ultimate solution is that governments host databases in perpetuity. This raises a myriad issues. Does it imply storage and no development? Is this for all publicly funded databases? Or a subset? Who are the chosen ones? And of course, how long will the government pay for it? The NIH Commons, while not being designed for database persistence, is one such prototypical infrastructure that could start addressing these questions.

In conclusion, the issue of database sustainability is problematic and unsolved and the problem is only going to get worse. While unfortunate for Open science (and science in general) the commercialization of databases will always be a possibility. One hopes that in such cases, a balance will be struck between income and free (re)usage of these valuable resources.

Written by Rajarshi Guha

May 14th, 2016 at 7:26 pm

Cryptography & Chemical Structure Search

without comments

Encryption of chemical information has not been a very common topic in cheminformatics. There was an ACS symposium in 2005 (summary) that had a number of presentations on the topic of “safe exchange” of chemical information – i.e., exchanging information on chemical structures without sharing the structures themselves. The common thread running through many presentations was to identify representations (a.k.a, descriptors) that can be used for useful computation (e.g., regression or classification models or similarity searches) but do not allow one to (easily) regenerate the structure. Examples include the use of PASS descriptors and various topological indices. Non-descriptor based approaches included, surrogate data (that is structures of related molecules with similar properties) and most recently, scaffold networks. Also, Masek et al, JCIM, 2008 described a procedure to assess the risk of revealing structure information given a set of descriptors.

As indicated by Tetko et al, descriptor based approaches are liable to dictionary based attacks. Theoretically if one fully enumerates all possible molecules and computes the descriptors it would be trivial to obtain the structure of an obfuscated molecule. While this is not currently practical, Masek et al have already shown that an evolutionary algorithm can reconstruct the exact (or closely related) structure from BCUT descriptors in a reasonable time frame and Wong & Burkowski, JCheminf, 2009 described a kernel approach to generating structures from a set of descriptors (though they were considering the inverse QSAR problem rather than chemical privacy). Uptil now I wasn’t aware of approaches that were truly one way – impossible to regenerate the structure from the descriptors, yet also perform useful computations.

Which brings me to an interesting paper by Shimuzu et al which describes a cryptographic approach to chemical structure search, based on homomorphic encryption. A homomorphic encryption scheme allows one to perform computations on the encrypted (usually based on PKI) input leading to an encrypted result, which when decrypted gives the same result as if one had performed the computation on the clear (i.e., unecnrypted) input. Now, a “computation” can involve a variety of operations – addition, multiplication etc. Till recently, most homomorphic schemes were restricted to one or a few operations (and so are termed partially homomorphic). It was only in 2009 that a practical proposal for a fully homomorphic (i.e., supporting arbitrary computations) cryptosystem was described. See this excellent blog post for more details on homomorphic cryptosystems.

The work by Shimuzu et al addresses the specific case of a user trying to identify molecules from a database that are similar to a query structure. They consider a simplified situation where the user is only interested in the count of molecules above a similarity threshold. Two constraints are:

  1. Ensure that the database does not know the actual query structure
  2. The user should not gain information about the database contents (except for number of similar molecules)

Their scheme is based on a additive homomorphic system (i.e., the only operation supported on the encrypted data is addition) and employs binary fingerprints and the Tversky similarity metric (which can be reduced to Tanimoto if required). I note that they used 166-bit MACCS keys. Since it’s small and each bit position is known it seems that some information could leak out of the encrypted fingerprint or be subject to a dictionary attack. I’d have expected that using a larger hashed fingerprint would have helped improve the security. (Though I suspect that the encryption of the query fingerprint alleviates this issue). Another interesting feature, designed to prevent information about the database leaking back to the user is the use of “dummies” – random, encrypted (by the users public key) integers that are mixed with the true (encrypted) query result. Their design allows the user to determine the sign of the query result (which indicates whether the database molecule is similar to the query, above the specified threshold), but does not let them get the actual similarity score. They show that as the number of dummies is increased, the chances of database information leaking out tends towards zero.

Of course, one could argue that the limited usage of proprietary chemical information (in terms of people who have it and people who can make use of it) means that the efforts put in to obfuscation, cryptography etc. could simply be replaced by legal contracts. Certainly, a simple way to address the scenario discussed here (and noted by the authors) is to download the remote database locally. of course this is not feasible if the remote database is meant to stay private (e.g., a competitors structure database).

But nonetheless, methods that rigorously guarantee privacy of chemical information are interesting from an algorithmic standpoint. Even though Shimuzu et al described a very simplistic situation (though the more realistic scenario where the similar database molecules are returned would obviously negate constraint 2 above), it looks like a step forward in terms of applying formal cryptanalysis to chemical problems and supporting truly safe exchange of chemical information.

Written by Rajarshi Guha

January 5th, 2016 at 3:17 am

Exploring ChEMBL Targets with Neo4j

without comments

As part of an internal project I’ve recently started working with Neo4j for representing and querying relationships between entities (targets, compounds, etc.). What has really caught my attention is the Cypher graph query language – by allowing you to construct queries using graph notation, many tasks that would be complex or tedious in a traditonal RDBMS become much easier.

As an example, I loaded the ChEMBL target hierarchy and the targets as a graph. On it’s own it’s not particularly useful – the real utility arises when other datasets (and datatypes) are linked to the targets. But even at this stage, one can easily ask questions such as

Find all kinase proteins

which is simply a matter of identifying proteins that have a direct path to the Kinase target class.

Assuming you have ChEMBL loaded in to a MySQL database, you can generate a Neo4j graph database containing the targets and classification hierarchy using code from the neo4jexpt repository. Simply compile and run as (appropriately changing host name, user and password)

1
2
3
$ mvn package
$ java -Djdbc.url="jdbc:mysql://host.name/chembl_20?user=USER&password=PASS" \
       -jar target/neo4j-ctl-1.0-SNAPSHOT.jar graph.db

Once complete, you should see a folder named graph.db. Using the Neo4j application you can then explore the graph in your browser by executing Cypher queries. For example, lets get the graph of the entire ChEMBL target classification hierarchy (and ensuring that we don’t include actual proteins)

1
2
MATCH (n {TargetType:'TargetFamily'})-[r]-(m {TargetType:'TargetFamily'})
  RETURN r

(The various annotations such as TargetType and TargetFamily are based on my code). When visualized we get

prolif-bp

Lets get more specific, and extract the kinase portion of the classification hierarchy

1
2
3
4
MATCH (n {TargetType:'TargetFamily'}),
      (m {TargetID:'Kinase'}),
      p = shortestPath( (n)-[:ChildOf*]->(m) )
  RETURN p

prolif-bp

Given that we’ve linked the protein themselves to the target classes, we can now ask for all proteins that are kinases

1
2
3
4
MATCH (m {TargetType:'MolecularTarget'}),
      (n {TargetID:'Kinase'}),
      p = shortestPath( (m)-[*]->(n) )
  RETURN m

Or identify the target classes that are linked to more than 25 proteins

1
2
3
4
MATCH ()-[r1:IsA]-(m:TargetBiology {TargetType:"TargetFamily"})
  WITH m, COUNT(r1) AS relCount
  WHERE relCount > 25
  RETURN m

which gives us a table of target classes and counts, part of which is shown below

prolif-bp

Overall this seems to be a very powerful platform to integrate data sources and types and effectively query for relationships. The browser based view is useful to practice Cypher and answer questions of the dataset. But a REST API is available as well as other tools such as Gremlin that allow for much more flexible applications and sophisticated queries.

Written by Rajarshi Guha

November 14th, 2015 at 6:10 pm

Substructure Searches – High Speed, Large Scale

with 5 comments

My NCTT colleague, Trung Nguyen, recently announced a prototype chemical substructure search system based on fingerprint pre-screening and an efficient in-memory indexing scheme. I won’t go into the detail of the underlying pre-screen and indexing methodology (though the sources are available here). He’s provided a web interface allowing one to draw in substructure queries or specify SMILES or SMARTS patterns, and then search for substructures across a snapshot of PubChem (more than 30M structures).

It is blazingly fast.

I decided to run some benchmarks via the REST interface that he provided, using a set of 1000 SMILES derived from an in-house fragmentation of the MLSMR. The 1000 structure subset is available here. For each query structure I record the number of hits, time required for the query and the number of atoms in the query structure. The number of atoms in the query structures ranged from 8 to 132, with a median of 16 atoms.

The figure below shows the distribution of hits matching the query and the time required to perform the query (on the server) for the 1000 substructures. Clearly, the bulk of the queries take less than 1 sec, even though the result set can contain more than 10,000 hits.

The figures below provide another look. On the left, I plot the number of hits versus the size of the query. As expected, the number of matches drops of with the size of the query. We also observe the expected trend between query times and the size of the result sets. Interestingly, while not a fully linear relationship, the slope of the curve is quite low. Of course, these times do not include retrieval times (the structures themselves are stored in an Oracle database and must be retrieved from there) and network transfer times.

Finally, I was also interested in getting an idea of the number of hits returned for a given size of query structure. The figure below summarizes this data, highlighting the variation in result set size for a given number of query atoms. Some of these are not valid (e.g., query structures with 35, 36, … atoms) as there were just a single query structure with that number of atoms.

Overall, very impressive. And it’s something you can play with yourself.

Written by Rajarshi Guha

November 23rd, 2011 at 1:09 am

HTS and Message Queues

without comments

In my previous post I discussed how we’d like to automate some of our screens – starting from the primary screen, going through data processing and compound selection and completing the secondary (follow up) screen. A key feature of such a workflow is the asynchronous nature of the individual steps. Messaging and Message queues (MQ) provide an excellent approach to handling this type of problem.

Message queue systems

A number of such MQ systems are available such as ActiveMQ, RabbitMQ and so on. See here for a comparison of different MQ systems. Given that we already use Oracle for our backend databases, we use Oracle Advanced Queue (AQ). One advantage of this is that we can store the messages in the database, allowing us to keep a history of a screen as well as use SQL queries to retrieve messages if desired. Such storage can obviously slows things down, but our message throughput is low enough that it doesn’t matter for us.

In this post I’ll briefly describe how I set up a queue on the database side and show the code for a Java application to send a message to the queue and retrieve a message from the queue. The example will actually use the JMS API, which Oracle AQ implements. As a result, the code can trivially swap out AQ for any other JMS implementation.

Creating queues & tables

The first step is to create a queue table and some queues in the database. The PL/SQL to generate these is

1
2
3
4
5
6
7
8
9
10
11
12
13
14
BEGIN
DBMS_AQADM.create_queue_table(
queue_table => 'test_qt',
queue_payload_type => 'SYS.AQ$_JMS_MESSAGE');

DBMS_AQADM.create_queue(
queue_table => 'test_qt',
queue_name => 'input_q',
retention_time => DBMS_AQADM.INFINITE);

DBMS_AQADM.start_queue('input_q');
END;
/
quit

So we’ve created a queue table called test_qt which will hold a queue called input_q. The plan is that we’ll have a process listening on this queue and processing each message as it comes and another process that will send a specified number of messages to the queue. The queue_payload_type argument to the create call, indicates that we can store any of the standard JMS message types (though we’ll be focusing on the text message type). We’ve also specified that for the input_q queue, messages will be retained in the database indefinitely. This is useful for debugging and auditing purposes.

Message producers & consumers

OK, with the queues set up, we can now write some Java code to send messages and receive them. In this example, the receiving code will actually run continuously, blocking until messages are received.

This example extends TimerTask. The strategy is that when the listener receives a message, it will create a new instance of this task and schedule it immediately on a new thread. As a result the message processing logic is contained within the run method. At this stage, we only consider messages that are of type TextMessage. If that’s the case we simply extract the payload of the message and print it to STDOUT.

You’ll note that we also create a unique listener ID and include that in the output. This is handy when we run multiple listeners and want to check that messages are being received by all of them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
public class QueueExample extends TimerTask {
    static final String URL = "jdbc:oracle:thin:USER/PASSWD@HOST:PORT:SID";
    private Message mesg;

    /* Useful to differentiate between multiple instances of the listener */
    private static final String listenerID = UUID.randomUUID().toString();

    static final String schema = "wtc";
    static final String qTable = "test_qt";
    static final String qName = "input_q";

    static QueueConnection con = null;
    static QueueSession sess = null;
    static javax.jms.Queue q = null;

    public QueueExample(Message m) {
        mesg = m;
    }

    public void run() {
        try {
            if (!(mesg instanceof TextMessage)) return;

            String payload = ((TextMessage) mesg).getText();
            System.out.println(listenerID + ": Got msg: " + payload);
        }
        catch (JMSException e) {
            e.printStackTrace();
        }
    }

Before looking at sending and receiving messages we need to initialize the connection to the message queue

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    private static void initializeQueue() throws JMSException {
        QueueConnectionFactory queue = AQjmsFactory.getQueueConnectionFactory(URL, new Properties());
        QueueConnection con = (QueueConnection) queue.createConnection();
        con.start();

        sess = (QueueSession) con.createSession(false, Session.AUTO_ACKNOWLEDGE);
        AQQueueTable qtab = ((AQjmsSession) sess).getQueueTable(schema, qTable);

        try {
            q = ((AQjmsSession) sess).getQueue(schema, qName);
        }
        catch (Exception ex) {
            AQjmsDestinationProperty props = new AQjmsDestinationProperty();
            q = ((AQjmsSession) sess).createQueue(qtab, qName, props);
        }
    }

The next step is to listen for messages and dispatch them for processing. The method below initializes the queue if it isn’t already initialized. After creating a consumer object, we simply wait for messages to come in. The receive method is blocking, so the program will wait for the next message. Once a message is received it creates an instance of this class and schedules it – when the thread starts, the run method will execute to process the message.

1
2
3
4
5
6
7
8
9
10
11
12
13
    public static void listener() throws JMSException {
        if (q == null) initializeQueue();
        System.out.println(listenerID + ": Listening on queue " + q.getQueueName() + "...");
        MessageConsumer consumer = sess.createConsumer(q);

        // each time we get a message, start up the message handler in a new thread
        for (Message m; (m = consumer.receive()) != null;) {
            new Timer().schedule(new QueueExample(m), 0);
        }

        sess.close();
        con.close();
    }

The final component is to send messages. For this simple example, it’s primarily boiler plate code. In this case, we specify how many messages to send. The DeliveryMode.PERSISTENT indicates that the messages will be stored (in this case in the DB) until a consumer has received it. Note that after receipt by a consumer the message may or may not be stored in the database. See here for more details.

In the code below, we can set a variety of properties on the message. For example, we’ve set an “application id” (the JMSXAppID property) and a correlation id. Right now, we ignore this, but it can be used to link messages or even link a message to an external resource (though that could also be done via the payload itself). Another useful property that could be set is the message type via setJMSType. Using this one can assign a MIME type to a message allowing the message processing code to conditionally handle the message based on the type. For more details on the various properties that can be set see Message documentation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
    public static void sender(int n) throws JMSException {
        if (q == null) initializeQueue();

        MessageProducer producer = sess.createProducer(q);
        producer.setDeliveryMode(DeliveryMode.PERSISTENT);

        Message msg;
        for (int i = 0; i < n; i++) {
            msg = sess.createTextMessage();
            msg.setStringProperty("JMSXAppID", "QueueExample");
            msg.setJMSCorrelationID(UUID.randomUUID().toString());
            ((TextMessage) msg).setText("This is message number " + i);
            producer.send(msg);
        }
        producer.close();
        sess.close();
    }

Running

The complete source code can be found here. To compile it you’ll need an OJDBC jar file as well as the following jar files (that come with the Oracle installation)

  • $ORACLE_HOME/rdbms/jlib/aqapi.jar
  • $ORACLE_HOME/rdbms/jlib/jmscommon.jar
  • $ORACLE_HOME/jlib/jndi.jar
  • $ORACLE_HOME/jlib/jta.jar
  • $ORACLE_HOME/rdbms/jlib/xdb.jar
  • $ORACLE_HOME/lib/xmlparserv2.jar

Once the code has been compiled to a jar file, we first start the listener:

1
2
guhar$ java -jar dist/qex.jar listen
8b9fc2a2-533c-4426-a368-3e6ddfb41587: Listening on queue input_q...

In another terminal we send some messages

1
guhar$ java -jar dist/qex.jar send 5

Switching to the previous terminal we should see something like

1
2
3
4
5
8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 0
8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 1
8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 2
8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 3
8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 4

The fun starts when we instantiate multiple listeners (possible on different machines). It’s simple enough to execute the first invocation above multiple times and watch the output as we send more messages. If you send 10 messages, you should see that some are handled by one listener and the remainder by another one and so on. if the actual message processing is compute intensive, this allows you to easily distribute such loads easily.

Next steps

The code discussed here is a minimalistic example of sending and receiving messages from a queue. In the next post, I’ll discuss how we can represent messages in the database using a custom message type (defined in terms of an Oracle ADT) and send and receive such messages using Java. Such custom message types allow the Java code to remain object oriented, with the AQ libraries handling serialization and deserialization of the messages between our code and the queue.

One of the downsides that I see with Oracle AQ is that the only clients supported are PL/SQL, C and Java. While AQ implements the JMS API, it employs its own wire protocol. The lack of support for  AMQP means that a lot of client libraries in other languages cannot be used to send or retrieve messages from AQ. If anybody knows of Python packages that work with Oracle AQ I’d love to hear about them. (Looks like stomppy might support AQ?)

Written by Rajarshi Guha

July 11th, 2010 at 9:00 pm

Posted in software

Tagged with , , , , ,