A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.
Andrew Dalke posted a comment:
I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .
“Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”
Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.
This is a very good point. The problem with high dimensional spaces is that the points tend to be located in the corners and the bulk of the space is empty. What this means for a spatial index is that the database will end up looking at big chunks of the table. What makes this worse is that the entire index will have to be scanned, likely followed by a scan over a large portion of the table. So, why use the index in the first place?
When we started of with the monolithic database of 17M rows, it turned out that index scans were generally faster than a linear scan. However, since the machine was shared with other processes and did not have particularly fast disks, it’s not entirely clear whether the time advantage was derived purely from the index. After we went to the partitioned database, I did not do any performance comparisons. Based on Andrews comment I took a look at the performance on our latest database.
Firstly, the partitioned version consists of 6 databases, each with about 3M rows. Each machine has a relatively low load average and also has 10K RPM disks. I performed my queries on just one of the partitions, using a 100 random molecules with a query radius of 2 (which gives a result set of about 99K rows). First I ran the query with a linear scan, followed by an indexed scan. The summary of the timings are
Linear | Indexed | |
---|---|---|
Min (ms) | 1582 | 757 |
1st Quartile (ms) | 1865 | 3028 |
Median (ms) | 2145 | 3970 |
Mean (ms) | 2160 | 5459 |
3rd Quartile (ms) | 2405 | 4498 |
Max (ms) | 2987 | 58782 |
It’s clear that the linear scan beats the indexed version. The variance in the timings is also understandable, since the indexed scan is very dependent on the distribution of points in the space. The nice thing about the linear scan is that whatever the query molecule, the time is bounded above by the time required to scan the whole database, which seems to be about 3s for 3M rows. However, will the performance advantage of the linear scan be retained if we were to with 17M rows on these machines? Can’t say till we test it.
One of the lessons I learnt from this is that benchmarking performance is always better than assuming performance! Furthermore, while brute force is inelegant, it can be useful. Thanks to Andrew, Pub3D now uses linear scan queries rather than indexed queries.
Postscript
I should also note that R-tree indexes are generally useful for 2D datasets (such as in GIS applications). My use of it for 12D data is certainly not the best use of this type of index. We had initially investigated the M-tree index, which is supposed to be better suited for high dimensional (10D to 20D) data than the R-tree. Our initial tests on the monolithic database indicated that it outperformed R-tree and linear scans. But we haven’t tested it’s performance on the partitioned databases.
Another optimization that might help a spatial indexing scheme is a clustered version of the database rather than a partitioned. That is, the partitioned database is simply a chunked form of the 17M rows, without regard to locality of the points. Clustering should maintain locality to some extent – the downside is that updates to the database will require reclustering on some sort of schedule.
Finally, one could also consider approximate nearest neighbor methods. Andrew already pointed out ANN. I had investigated locality sensitive hashing sometime back which seemed to be useful – since these are not available as Postgres GiST indexes, one would have to forgo the advantages of the database infrastructure.
Thanks for doing the test and reporting the results!