BAZOO

So much to do, so little time

Trying to squeeze sense out of chemical data

Multi-threaded Database Access with Python

with 8 comments

Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single monolithic database with long query times, to multiple databases and significantly faster  multi-threaded queries.

Indexing Blues: The method allows us to represent each molecules as a 12-D vector. We can then identify molecules similar in shape to a query by identifying the nearest neighbors (within a radius R) to the query molecule in 12-D space. To do this fast, we employ an R-tree index, which allows us to perform such nearest neighbor queries very efficiently. However, we face a problem. The goal of an index is to allow us to avoid scanning a whole table. Ideally, an index will be a compact representation of a table column and in general one expects that an index is stored in RAM. For the case of Pub3D, the size of the R-tree index is approximately 5GB and we cannot store it in RAM. As a result, simply reading the index took a significant amount of time and even by increasing the amount of RAM available to the Postgres server (shared_buffers) didn’t improve things a whole lot. Furthermore, if we wanted to include multiple conformers, the table size could expand by a factor of 10 at the minimum. So the initial approach of a single database was untenable.

Solution: The simple solution was to partition the table into six separate tables, and place them on six different machines (using separate disks rather than something like NFS). This leads to multiple benefits. First, the R-tree index is much smaller for each individual database and all of it (or a  large fraction) can be stored in RAM.  As a result, queries are significantly faster. Second, each database can be queried independently – since each one uses its own disk, the queries can be truly asynchronous.

The next step is to query these databases. The front end to Pub3D is a PHP page that retrieves results via a REST-like interface (example) to the actual databases. The interface is implemented in Python, using mod_python and psycopg2. Given that we have six databases that can be queried simultaneously, it’s natural that the Python code should be multi-threaded. But before describing that, let’s consider the performance of serial code. That is, query each database one by one. We use a simple query

    select count(*) from pubchem_3d

The code to perform the queries would look something like this

    for host, port in condetails:
        con = _getConnection(host, port)
        if not con: raise Exception
        cursor = con.cursor()
        cursor.execute(query)
        for row in cursor.fetchall():
            allRows.append(row)

The Python code connects to the databases remotely and simply retrieves the result of the query and places it into a global list. The time taken for the above code is approximately 17.4s (the database servers run Postgres 8.2 and have 4GB RAM, with 2.5GB allocated to shared_buffers)

But 17.4s is not fast enough. So we next consider how we can query the databases using multiple threads. To do this we create a class derived from threading.Thread, whose job is to query a single database.

class DBThread(threading.Thread):
      def __init__(self, host, port, query):
        self.host = host
        self.port = port
        self.query = query
        threading.Thread.__init__(self)
     def run(self):
       global allRows
       con = _getConnection(self.host, self.port)
       if not con: return
       cursor = con.cursor()
       cursor.execute(self.query)
       for row in cursor.fetchall():
            allRows.append(row)
       con.close()
       return

The class is initialized with the host name and port number for the remote database, along with the query string. When each instance of the class is started, it performs the query and puts the rows of the result into a global list. So we can easily start up six threads by writing, and wait for them to all finish before proceeding.

    query = "select count(*) from pubchem_3d"
    for host, port in condetails:
        dbt = DBThread(host, port, query)
        threadList.append(dbt)
        dbt.start()

    while threading.activeCount() > 1: time.sleep(1)

When the above code is timed, we get all the results back in 4s – much nicer! Now, this is quite a simple problem as there are no deadlock issues and other monsters of multi-threaded programming to address.

Caveats: The code above is certainly not optimal. First, since there is overhead to starting a thread, so one should be using a thread pool. In this case, the number of threads is always fixed, so I didn’t bother. Second, and probably most importantly, the threading module does not really provide true threads. There has been much discussion surrounding the Global Interpreter Lock (GIL), and I’m not enough of an expert to comment on this. However the problem with the threading module can be seen by the fact that they all run on the same core of a multi-core CPU. Supposedly, Python3000 aims to solve this issue and provides true threading support. An alternative is to consider the multiprocessing module, in 2.6 and above. This module uses subprocess rather than threads and so provides support for concurrency that avoids the GIL. As with the threading module, it’s also quite easy to use.

But the conclusion is that for data parallel type tasks, the Python threading module allows us to easily write code to achieve nice speedups.

Written by Rajarshi Guha

November 14th, 2008 at 4:46 pm

8 Responses to 'Multi-threaded Database Access with Python'

Subscribe to comments with RSS or TrackBack to 'Multi-threaded Database Access with Python'.

  1. Python 3.0 does NOT solve this issue. It doesn’t even touch it. The alternative of multiprocessing is valid though, if it fits your use case.

    Your application presumably spends most of its time waiting for the database, thus the GIL had minimal impact. However, on a multicore box, if you kept adding threads you would eventually hit limits.

    Rhamphoryncus

    14 Nov 08 at 5:57 pm

  2. Aah, thanks for the pointer. Yes, you’re right about the code waiting on the DB

    Rajarshi Guha

    14 Nov 08 at 6:08 pm

  3. Or use IPy or Jython if possible. AFAI understand, these don’t have a GIL.

    baoilleach

    15 Nov 08 at 2:39 pm

  4. I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .

    “”"Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”"”

    Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.

    Have you done any timing comparisons between these? You’re in Postgres so I don’t know how easy it is to do that comparison.

    Andrew Dalke

    20 Nov 08 at 4:33 am

  5. [...] 20, 2008 by Rajarshi Guha A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned [...]

  6. Rajarshi Guha

    20 Nov 08 at 5:44 pm

  7. I noticed that the time dropped from 17s to 4s – you added 5 more computers (6 in total). 4s is more that is more than 17s/6

    I don’t know Postgresql but looks to me that for the first query, the db was not in RAM. The 17M database couldn’t have been and if the I/O wait is the worst thing, then I would expect that 6 servers each with DB in RAM give you better results than single server (better than 17s/6 bacause of the RAM)

    it would be nice if you could do the timing running the tests several times

    roman

    28 Dec 08 at 10:31 pm

  8. Well timing results will depend on what is being timed :) The 4 seconds that I reported included performing the query, collecting the results from the DB across the network into a Python list (for each sub-DB), and then collecting all the results into a single list. So that would explain part of the extra time over 2.8 s (which one might expect if the query time scaled linearly, i.e. 17 s / 6).

    But also, I noted that we end up doing a linear scan, so it’s possible that the entire DB is never completely in RAM. Also these are not dedicated servers, so it’s also possible that other process cause part of the DB to be paged out. I’m not very familiar with the RAM usage / caching strategies that Postgres employs.

    To specifically answer your question, I ran the same query 9 times and the mean query time (for the procedure described above) was 4006 ms (sd = 2.34 ms). On the other hand if I time just the query for each individual database (i.e., ignoring the retrieval of results from the DB and all later steps), the average times ranges from 2276 ms to 3101 ms (averaged over 7 runs). Thus, the actual query time for a single DB is much closer to what is expected.

    Of course, this should be all done a little more rigorously – and if we get round to writing the paper I probably will :)

    Rajarshi Guha

    29 Dec 08 at 12:45 am

Leave a Reply