So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘benchmark’ tag

Performance of AtomContainerSet versus ArrayList

with 6 comments

In my previous post, I had asked whether we really need AtomContainerSet and other related specialized container classes as opposed to using parametrized List objects. Gilleain had mentioned some issues that might require these specialized classes. But at this point it’s not clear to me what the intended goal for these classes was.

For now, assuming that they are used purely as storage classes, I decided to do some performance testing, comparing AtomContainerSet to ArrayList<IAtomContainer>. Currently these results are based on System.currentTimeMillis() as my YourKit licence has expired. So the results are not as fine grained as I’d like.

To perform the tests I read in 2000 3D structures taken from Pub3D and loaded them into an IAtomContainer[]. I then considered four operations

  • Given an empty AtomContainerSet, we simply loop over the 2000-element array and add each molecule to the empty container using addAtomContainer. We do the same thing with an empty ArrayList<IAtomContainer>, but use add. The operation is repeated 100 times and the mean time for the addition of 2000 molecules to the empty containers is taken.
  • Randomly access elements of the AtomContainerSet and ArrayList, 50,000 times
  • Access each element serially (using for (IAtomContainer x : container) idiom). Averaged 100 times
  • Remove 500 molecules by reference, randomly from each container structure
  • Remove 500 molecules by index, randomly from each container structure

The summary view of the results is shown below. For each pair of bars, the time taken for the operation on the AtomContainerSet is normalized to 1.0 and the time with the ArrayList is appropriately converted.

The raw timings are shown below:

Operation AtomContainerSet (sec) ArrayList (sec)
Add 0.0036 0.0001
Indexed Random Access 0.006 0.007
Indexed Serially 0.007 0.012
Remove (by reference) 68.119 0.1
Remove (by index) 0.695 0.0

As you can see, the ArrayList is significantly faster in all except random and serial access. In that case, if you consider the raw numbers, the run times are essentially equivalent for random access, though serial access of an array in AtomContainerSet gives it an edge over ArrayList.

So if AtomContainerSet and related classes are purely for storage purposes, they could easily be replaced by ArrayList’s. However the fact that they include support for notifications suggests somebody is using it for more complex scenarios. In either case, a discussion of the design and role of these classes would be useful

Written by Rajarshi Guha

April 15th, 2009 at 5:27 pm

The Speedups Keep on Coming

with 7 comments

A while back I wrote about some updates I had made to the CDK fingerprinting code to improve performance. Recently Egon and Jonathan Alvarsson (Uppsala) had made even more improvements. Some of them are simple fixes (making a String[] final, using Set rather than List) while others are more significant (efficient caching of paths). In combination, they have improved performance by over 50%, compared to my last update. Egon has put up a nice summary of performance runs here. Excellent work guys!

Written by Rajarshi Guha

December 4th, 2008 at 5:41 pm

Brute Force – Inelegant, But Sometimes Useful

with one comment

A few days back I posted on improving query times in Pub3D by going from a monolithic database (17M rows), to a partitioned version (~ 3M rows in 6 separate databases) and then performing queries in parallel. I also noted that we were improving query times by making use of an R-tree spatial index.

Andrew Dalke posted a comment:

I’ve wondered about this quote from the ANN page at http://www.cs.umd.edu/~mount/ANN/ .

Computing exact nearest neighbors in dimensions much higher than 8 seems to be a very difficult task. Few methods seem to be significantly better than a brute-force computation of all distances.”

Since you’re in 12-D space, this suggests that a linear search would be faster. The times I’ve done searches for near neighbors in higher dimensional property space have been with a few thousand molecules at most, so I’ve never worried about more complicated data structures.

Read the rest of this entry »

Written by Rajarshi Guha

November 20th, 2008 at 5:42 pm

Java Port of VFLib Works and it’s Blazing

with 8 comments

Sometime back I described how I was porting the VFLib algorithms to Java, so that we could use it for substructure search, since the current UniversalIsomorphismTester is pretty slow for this task, in general. While I had translated the Ullman algorithm implementation of VFLib and shown that it outperformed the CDK method, it turned out that didn’t work for certain cases such as finding CCC in C1CC1. This was due to a different definition of isomorphism that VFLib used. Instead, I tried to convert the VF2 implementation to Java. The motivation was that it does indeed perform substructure matching as is usually understood in cheminformatics, and was also reported to be extremely fast. Unfortunately, my translation was buggy and I put it on hold.

Read the rest of this entry »

Written by Rajarshi Guha

November 18th, 2008 at 11:17 pm

Do the CDK Fingerprints Work?

with 12 comments

In a previous post, I dicussed virtual screening benchmarks and some new public datasets for this purpose. I recently improved the performance of the CDK hashed fingerprints and the next question that arose is whether the CDK fingerprints are any good. With these new datasets, I decided to quantitatively measure how the CDK fingerprints compare to some other well known fingerprints.

Update – there was a small bug in the calculations used to generate the enrichment curves in this post. The bug is now fixed. The conclusions don’t change in a significant way. To get the latest (and more) results you should take a look here.

Read the rest of this entry »

Written by Rajarshi Guha

October 11th, 2008 at 5:47 am