Multi-threaded Database Access with Python

Pub3D contains about 17.3 million 3D structures for PubChem compounds, stored in a Postgres database. One of the things we wanted to do was 3D similarity searching and to achieve that we’ve been employing the Ballester and Graham-Richards method. In this post I’m going to talk about performance – how we went from a single monolithic database with long query times, to multiple databases and significantly faster  multi-threaded queries.

Indexing Blues: The method allows us to represent each molecules as a 12-D vector. We can then identify molecules similar in shape to a query by identifying the nearest neighbors (within a radius R) to the query molecule in 12-D space. To do this fast, we employ an R-tree index, which allows us to perform such nearest neighbor queries very efficiently. However, we face a problem. The goal of an index is to allow us to avoid scanning a whole table. Ideally, an index will be a compact representation of a table column and in general one expects that an index is stored in RAM. For the case of Pub3D, the size of the R-tree index is approximately 5GB and we cannot store it in RAM. As a result, simply reading the index took a significant amount of time and even by increasing the amount of RAM available to the Postgres server (shared_buffers) didn’t improve things a whole lot. Furthermore, if we wanted to include multiple conformers, the table size could expand by a factor of 10 at the minimum. So the initial approach of a single database was untenable.

Solution: The simple solution was to partition the table into six separate tables, and place them on six different machines (using separate disks rather than something like NFS). This leads to multiple benefits. First, the R-tree index is much smaller for each individual database and all of it (or a  large fraction) can be stored in RAM.  As a result, queries are significantly faster. Second, each database can be queried independently – since each one uses its own disk, the queries can be truly asynchronous.

The next step is to query these databases. The front end to Pub3D is a PHP page that retrieves results via a REST-like interface (example) to the actual databases. The interface is implemented in Python, using mod_python and psycopg2. Given that we have six databases that can be queried simultaneously, it’s natural that the Python code should be multi-threaded. But before describing that, let’s consider the performance of serial code. That is, query each database one by one. We use a simple query

    select count(*) from pubchem_3d

The code to perform the queries would look something like this

    for host, port in condetails:
        con = _getConnection(host, port)
        if not con: raise Exception
        cursor = con.cursor()
        cursor.execute(query)
        for row in cursor.fetchall():
            allRows.append(row)

The Python code connects to the databases remotely and simply retrieves the result of the query and places it into a global list. The time taken for the above code is approximately 17.4s (the database servers run Postgres 8.2 and have 4GB RAM, with 2.5GB allocated to shared_buffers)

But 17.4s is not fast enough. So we next consider how we can query the databases using multiple threads. To do this we create a class derived from threading.Thread, whose job is to query a single database.

class DBThread(threading.Thread):
      def __init__(self, host, port, query):
        self.host = host
        self.port = port
        self.query = query
        threading.Thread.__init__(self)
     def run(self):
       global allRows
       con = _getConnection(self.host, self.port)
       if not con: return
       cursor = con.cursor()
       cursor.execute(self.query)
       for row in cursor.fetchall():
            allRows.append(row)
       con.close()
       return

The class is initialized with the host name and port number for the remote database, along with the query string. When each instance of the class is started, it performs the query and puts the rows of the result into a global list. So we can easily start up six threads by writing, and wait for them to all finish before proceeding.

    query = "select count(*) from pubchem_3d"
    for host, port in condetails:
        dbt = DBThread(host, port, query)
        threadList.append(dbt)
        dbt.start()

    while threading.activeCount() > 1: time.sleep(1)

When the above code is timed, we get all the results back in 4s – much nicer! Now, this is quite a simple problem as there are no deadlock issues and other monsters of multi-threaded programming to address.

Caveats: The code above is certainly not optimal. First, since there is overhead to starting a thread, so one should be using a thread pool. In this case, the number of threads is always fixed, so I didn’t bother. Second, and probably most importantly, the threading module does not really provide true threads. There has been much discussion surrounding the Global Interpreter Lock (GIL), and I’m not enough of an expert to comment on this. However the problem with the threading module can be seen by the fact that they all run on the same core of a multi-core CPU. Supposedly, Python3000 aims to solve this issue and provides true threading support. An alternative is to consider the multiprocessing module, in 2.6 and above. This module uses subprocess rather than threads and so provides support for concurrency that avoids the GIL. As with the threading module, it’s also quite easy to use.

But the conclusion is that for data parallel type tasks, the Python threading module allows us to easily write code to achieve nice speedups.

8 thoughts on “Multi-threaded Database Access with Python

  1. Python 3.0 does NOT solve this issue. It doesn’t even touch it. The alternative of multiprocessing is valid though, if it fits your use case.

    Your application presumably spends most of its time waiting for the database, thus the GIL had minimal impact. However, on a multicore box, if you kept adding threads you would eventually hit limits.

  2. Aah, thanks for the pointer. Yes, you’re right about the code waiting on the DB

  3. baoilleach says:

    Or use IPy or Jython if possible. AFAI understand, these don’t have a GIL.

  4. Andrew Dalke says:

    I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .

    “””Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”””

    Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.

    Have you done any timing comparisons between these? You’re in Postgres so I don’t know how easy it is to do that comparison.

  5. […] 20, 2008 by Rajarshi Guha A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned […]

  6. roman says:

    I noticed that the time dropped from 17s to 4s – you added 5 more computers (6 in total). 4s is more that is more than 17s/6

    I don’t know Postgresql but looks to me that for the first query, the db was not in RAM. The 17M database couldn’t have been and if the I/O wait is the worst thing, then I would expect that 6 servers each with DB in RAM give you better results than single server (better than 17s/6 bacause of the RAM)

    it would be nice if you could do the timing running the tests several times

Leave a Reply

Your email address will not be published. Required fields are marked *