# So much to do, so little time

Trying to squeeze sense out of chemical data

## Another Oracle Structure Search Cartridge

I came across an ASAP paper today describing substructure searching in Oracle databases. The paper comes from the folks at J & J and is part of their series of papers on the ABCD platform. Performing substructure searches in databases is certainly not a new topic and various products are out there that support this in Oracle (as well as other RDBMSs). The paper describes how the ABCD system does this using a combination of structure-derived hash keys and an inverted bitset based index and discuss their implementation as an Oracle cartridge. They provide an interesting discussion of how their implementation supports Cost Based Optimization of SQL queries involving substructure search. The authors run a number of benchmarks. In terms of comparative benchamrks they compare the performance (i.e., screening efficiency) of their hashed keys versus MACCS keys, CACTVS keys and OpenBabel FP2 fingerprints. Their results indicate that the screening step is a key bottleneck in the query process and that their hash key is generally more selective than the others.

Unfortunately, what would have been interesting but was not provided was a comparison of the performance at the Oracle query level with other products such as JChem Cartridge and OrChem. Furthermore, the test case is just under a million molecules from Golovin & Henrick – the entire dataset (not just the keys) could probably reside in-memory on todays servers. How does the system perform when say faced with PubChem (34 million molecules)? The paper mentions a command line implementation of their search procedure, but as far as I can tell, the Oracle cartridge is not available.

The ABCD system has many useful and interesting features. But as with the other publications on this system, this paper is one more in the line of “Papers About Systems You Can’t Use or Buy“. Unfortunate.

Written by Rajarshi Guha

November 10th, 2011 at 11:00 pm

Posted in Literature,cheminformatics

Tagged with , , ,

## HTS and Message Queues

In my previous post I discussed how we’d like to automate some of our screens – starting from the primary screen, going through data processing and compound selection and completing the secondary (follow up) screen. A key feature of such a workflow is the asynchronous nature of the individual steps. Messaging and Message queues (MQ) provide an excellent approach to handling this type of problem.

## Message queue systems

A number of such MQ systems are available such as ActiveMQ, RabbitMQ and so on. See here for a comparison of different MQ systems. Given that we already use Oracle for our backend databases, we use Oracle Advanced Queue (AQ). One advantage of this is that we can store the messages in the database, allowing us to keep a history of a screen as well as use SQL queries to retrieve messages if desired. Such storage can obviously slows things down, but our message throughput is low enough that it doesn’t matter for us.

In this post I’ll briefly describe how I set up a queue on the database side and show the code for a Java application to send a message to the queue and retrieve a message from the queue. The example will actually use the JMS API, which Oracle AQ implements. As a result, the code can trivially swap out AQ for any other JMS implementation.

## Creating queues & tables

The first step is to create a queue table and some queues in the database. The PL/SQL to generate these is

 1234567891011121314 BEGIN DBMS_AQADM.create_queue_table( queue_table => 'test_qt', queue_payload_type => 'SYS.AQ$_JMS_MESSAGE'); DBMS_AQADM.create_queue( queue_table => 'test_qt', queue_name => 'input_q', retention_time => DBMS_AQADM.INFINITE); DBMS_AQADM.start_queue('input_q'); END; / quit So we’ve created a queue table called test_qt which will hold a queue called input_q. The plan is that we’ll have a process listening on this queue and processing each message as it comes and another process that will send a specified number of messages to the queue. The queue_payload_type argument to the create call, indicates that we can store any of the standard JMS message types (though we’ll be focusing on the text message type). We’ve also specified that for the input_q queue, messages will be retained in the database indefinitely. This is useful for debugging and auditing purposes. ## Message producers & consumers OK, with the queues set up, we can now write some Java code to send messages and receive them. In this example, the receiving code will actually run continuously, blocking until messages are received. This example extends TimerTask. The strategy is that when the listener receives a message, it will create a new instance of this task and schedule it immediately on a new thread. As a result the message processing logic is contained within the run method. At this stage, we only consider messages that are of type TextMessage. If that’s the case we simply extract the payload of the message and print it to STDOUT. You’ll note that we also create a unique listener ID and include that in the output. This is handy when we run multiple listeners and want to check that messages are being received by all of them.  123456789101112131415161718192021222324252627282930 public class QueueExample extends TimerTask { static final String URL = "jdbc:oracle:thin:USER/PASSWD@HOST:PORT:SID"; private Message mesg; /* Useful to differentiate between multiple instances of the listener */ private static final String listenerID = UUID.randomUUID().toString(); static final String schema = "wtc"; static final String qTable = "test_qt"; static final String qName = "input_q"; static QueueConnection con = null; static QueueSession sess = null; static javax.jms.Queue q = null; public QueueExample(Message m) { mesg = m; } public void run() { try { if (!(mesg instanceof TextMessage)) return; String payload = ((TextMessage) mesg).getText(); System.out.println(listenerID + ": Got msg: " + payload); } catch (JMSException e) { e.printStackTrace(); } } Before looking at sending and receiving messages we need to initialize the connection to the message queue  12345678910111213141516 private static void initializeQueue() throws JMSException { QueueConnectionFactory queue = AQjmsFactory.getQueueConnectionFactory(URL, new Properties()); QueueConnection con = (QueueConnection) queue.createConnection(); con.start(); sess = (QueueSession) con.createSession(false, Session.AUTO_ACKNOWLEDGE); AQQueueTable qtab = ((AQjmsSession) sess).getQueueTable(schema, qTable); try { q = ((AQjmsSession) sess).getQueue(schema, qName); } catch (Exception ex) { AQjmsDestinationProperty props = new AQjmsDestinationProperty(); q = ((AQjmsSession) sess).createQueue(qtab, qName, props); } } The next step is to listen for messages and dispatch them for processing. The method below initializes the queue if it isn’t already initialized. After creating a consumer object, we simply wait for messages to come in. The receive method is blocking, so the program will wait for the next message. Once a message is received it creates an instance of this class and schedules it – when the thread starts, the run method will execute to process the message.  12345678910111213 public static void listener() throws JMSException { if (q == null) initializeQueue(); System.out.println(listenerID + ": Listening on queue " + q.getQueueName() + "..."); MessageConsumer consumer = sess.createConsumer(q); // each time we get a message, start up the message handler in a new thread for (Message m; (m = consumer.receive()) != null;) { new Timer().schedule(new QueueExample(m), 0); } sess.close(); con.close(); } The final component is to send messages. For this simple example, it’s primarily boiler plate code. In this case, we specify how many messages to send. The DeliveryMode.PERSISTENT indicates that the messages will be stored (in this case in the DB) until a consumer has received it. Note that after receipt by a consumer the message may or may not be stored in the database. See here for more details. In the code below, we can set a variety of properties on the message. For example, we’ve set an “application id” (the JMSXAppID property) and a correlation id. Right now, we ignore this, but it can be used to link messages or even link a message to an external resource (though that could also be done via the payload itself). Another useful property that could be set is the message type via setJMSType. Using this one can assign a MIME type to a message allowing the message processing code to conditionally handle the message based on the type. For more details on the various properties that can be set see Message documentation.  1234567891011121314151617 public static void sender(int n) throws JMSException { if (q == null) initializeQueue(); MessageProducer producer = sess.createProducer(q); producer.setDeliveryMode(DeliveryMode.PERSISTENT); Message msg; for (int i = 0; i < n; i++) { msg = sess.createTextMessage(); msg.setStringProperty("JMSXAppID", "QueueExample"); msg.setJMSCorrelationID(UUID.randomUUID().toString()); ((TextMessage) msg).setText("This is message number " + i); producer.send(msg); } producer.close(); sess.close(); } ## Running The complete source code can be found here. To compile it you’ll need an OJDBC jar file as well as the following jar files (that come with the Oracle installation) •$ORACLE_HOME/rdbms/jlib/aqapi.jar
• $ORACLE_HOME/rdbms/jlib/jmscommon.jar •$ORACLE_HOME/jlib/jndi.jar
• $ORACLE_HOME/jlib/jta.jar •$ORACLE_HOME/rdbms/jlib/xdb.jar
• $ORACLE_HOME/lib/xmlparserv2.jar Once the code has been compiled to a jar file, we first start the listener:  12 guhar$ java -jar dist/qex.jar listen 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Listening on queue input_q...

In another terminal we send some messages

 1 guhar$java -jar dist/qex.jar send 5 Switching to the previous terminal we should see something like  12345 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 0 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 1 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 2 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 3 8b9fc2a2-533c-4426-a368-3e6ddfb41587: Got msg: This is message number 4 The fun starts when we instantiate multiple listeners (possible on different machines). It’s simple enough to execute the first invocation above multiple times and watch the output as we send more messages. If you send 10 messages, you should see that some are handled by one listener and the remainder by another one and so on. if the actual message processing is compute intensive, this allows you to easily distribute such loads easily. ## Next steps The code discussed here is a minimalistic example of sending and receiving messages from a queue. In the next post, I’ll discuss how we can represent messages in the database using a custom message type (defined in terms of an Oracle ADT) and send and receive such messages using Java. Such custom message types allow the Java code to remain object oriented, with the AQ libraries handling serialization and deserialization of the messages between our code and the queue. One of the downsides that I see with Oracle AQ is that the only clients supported are PL/SQL, C and Java. While AQ implements the JMS API, it employs its own wire protocol. The lack of support for AMQP means that a lot of client libraries in other languages cannot be used to send or retrieve messages from AQ. If anybody knows of Python packages that work with Oracle AQ I’d love to hear about them. (Looks like stomppy might support AQ?) Written by Rajarshi Guha July 11th, 2010 at 9:00 pm Posted in software Tagged with , , , , , ## A GPL3 Oracle Cheminformatics Cartridge with 10 comments Sometime back I had mentioned a new cheminformatics toolkit, Indigo. Recently, Dmitry from SciTouch let me know that they had also developed Bingo, an Oracle cartridge based on Indigo, to perform cheminformatics operations in the database. This expands the current ecosystem of Open Source database cartridges (PGChem, MyChem, OrChem) which pretty much covers all the main RDBMSs (Postgres, MyQSL and Oracle). SciTouch have also provided a live instance of their database and associated cartridge, so you can play with it without requiring a local Oracle install. (It’d be useful to provide some details of the hardware that the DB is running on, so that timing numbers get some context) Written by Rajarshi Guha January 24th, 2010 at 2:35 pm Posted in software Tagged with , , , ## Oracle Notes with 3 comments Some handy settings when running a query from the command line via sqlplus set echo off set heading on set linesize 1024 set pagesize 0 set tab on set trims on set wrap off -- might want to set column formats here -- e.g.: column foo format A10 spool stats -- dump results to stats.lst -- SQL query here spool off exit Written by Rajarshi Guha October 6th, 2009 at 2:26 pm Posted in Uncategorized Tagged with , ## R and Oracle with 2 comments It’s been a while since my last post, but I’m getting up to speed at work. It’s been less than a month, but there’s already a ton of cool stuff going on. One of the first things I’ve been getting to grips with is the data infrastructure at the NCGC, which is based around Oracle. One of my main projects is handling informatics for RNAi screening. As the data comes out of the pilots, they get loaded into the Oracle infrastructure. Being an R aficionado, I’m doing the initial, exploratory analyses (normalization, hit selection, annotation etc.) using R. Thus I needed to have a way to access an Oracle DB from R. This is supported by the ROracle package. But it turns out that the installation is a little non-obvious and I figured I’d describe the procedure (on OS X 10.5) for posterity. The first thing to do is to get Oracle from here. Note that this is the full Oracle installation and while it comes with 32 bit and 64 bit libraries, some of the binaries that are required during the R install are 64 bit only. After getting the zip file, extract the installation files and run the installation script. Since I just needed the libraries (as opposed to running an actual Oracle DB), I just went with the defaults and opted out of the the actual DB creation step. After installation is done, it’s useful to set the following environment variables:  123 export ORACLE_HOME=/Users/foo/oracle export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$LD_LIBRARY_PATH export DYLD_LIBRARY_PATH=$ORACLE_HOME/lib:$DYD_LIBRARY_PATH With Oracle installed, execute the following  1$ORACLE_HOME/bin/genclntst

This will link a variety of object files into a library, which is required by the R package, but doesn’t come in the default Oracle installation.

The next thing is to get a 64 bit version of R from here and simply install as usual. Note that this will require you to reinstall all your packages, if you had a previous version of R around. Specifically, before installing ROracle, make sure to install the DBI package.

After installing R, get the ROracle 0.5-9 source package. Since there’s no binary build for OS X, we have to compile it ourselves. Before building, I like to CHECK the package to make sure that all is OK. Thus, the sequence of commands is

 1234 tar -zxvf ROracle_0.5-9.tar.gz R --arch x86_64 CMD CHECK ROracle R --arch x86_64 CMD BUILD ROracle R --arch x86_64 CMD INSTALL -l  /Users/guhar/Library/R/2.9/library ROracle_0.5-9.tar.gz

When I ran the CHECK, I did get some warnings, but it seems to be safe to ignore them.

At this stage, the ROracle package should be installed and you can start R and load the package. Remember to start R with the –arch x86_64 argument, since the ROracle package will have been built for the 64 bit version of R.

Written by Rajarshi Guha

June 17th, 2009 at 3:19 am

Posted in software

Tagged with , ,