Performance of AtomContainerSet versus ArrayList

In my previous post, I had asked whether we really need AtomContainerSet and other related specialized container classes as opposed to using parametrized List objects. Gilleain had mentioned some issues that might require these specialized classes. But at this point it’s not clear to me what the intended goal for these classes was.

For now, assuming that they are used purely as storage classes, I decided to do some performance testing, comparing AtomContainerSet to ArrayList<IAtomContainer>. Currently these results are based on System.currentTimeMillis() as my YourKit licence has expired. So the results are not as fine grained as I’d like.

To perform the tests I read in 2000 3D structures taken from Pub3D and loaded them into an IAtomContainer[]. I then considered four operations

  • Given an empty AtomContainerSet, we simply loop over the 2000-element array and add each molecule to the empty container using addAtomContainer. We do the same thing with an empty ArrayList<IAtomContainer>, but use add. The operation is repeated 100 times and the mean time for the addition of 2000 molecules to the empty containers is taken.
  • Randomly access elements of the AtomContainerSet and ArrayList, 50,000 times
  • Access each element serially (using for (IAtomContainer x : container) idiom). Averaged 100 times
  • Remove 500 molecules by reference, randomly from each container structure
  • Remove 500 molecules by index, randomly from each container structure

The summary view of the results is shown below. For each pair of bars, the time taken for the operation on the AtomContainerSet is normalized to 1.0 and the time with the ArrayList is appropriately converted.

The raw timings are shown below:

Operation AtomContainerSet (sec) ArrayList (sec)
Add 0.0036 0.0001
Indexed Random Access 0.006 0.007
Indexed Serially 0.007 0.012
Remove (by reference) 68.119 0.1
Remove (by index) 0.695 0.0

As you can see, the ArrayList is significantly faster in all except random and serial access. In that case, if you consider the raw numbers, the run times are essentially equivalent for random access, though serial access of an array in AtomContainerSet gives it an edge over ArrayList.

So if AtomContainerSet and related classes are purely for storage purposes, they could easily be replaced by ArrayList’s. However the fact that they include support for notifications suggests somebody is using it for more complex scenarios. In either case, a discussion of the design and role of these classes would be useful

More on the CDK Design Discussion

Based on my previous post, Gilleain had posted an extensive comment on the issues I had raised. I thought it’d be useful to bring those comments to the top level and add some of my own.

1) API Cleanup.

I would prefer if all access modifiers were explicit – especially package, since that is the default level and it is important to show that it is intentional, not just missing.

In general, I do agree that internal-only classes/methods/members should be private so that it is clear what should be used, and what should not.

graph stuff/jgrapt : I agree that some of the basic graph algorithms (like BFS) should be improved. I would favour upgrading jgrapht, despite the pain (version 0.6->0.8!).

geometry tools : hmmm. A difficult one. They may only be used in the 2D layout, but the CDK is a library, and it seems cruel to deny users the use of these methods.

atom tools : the methods calculate3DCoordinates[1,2,3] are really internal to that class and only used from add3dCoordinates1 – which is not used anywhere in the cdk. It looks like useful code, but…

bond tools: again, useful code, but needs better method naming and maybe methods moved to more appropriate classes. Both these classes are a kind of anti-pattern of ‘bunch of methods class’ – or whatever it is called.

Gilleains’ point regarding GeometryTools is valid – it’s useful to have a good repertoire of geometry handling methods. But it might be useful to try and pare down the class and just make public the really general methods.

2) IMolecule.

I totally agree. The only reason I can see for this class is for the psychological comfort of having a class called ‘Molecule’

More seriously, I think Christoph justified Molecule’s existence by making a distinction between disconnected atom containers and fully connected ‘molecules’.

I strongly believe that this kind of distinction should be made using a method; internal, like atomContainer.isFullyConnected(), or external, like AtomContainerManipulator.isConnected(atomContainer). Use of a type to enforce/distinguish this is a bad idea, as it leads to all sorts of horrible situations.

I think Gilleains suggestion of a specific method to determine whether a molecule is fully connected or not definitely makes sense. Furthermore, if the original distinction was indeed based on fully connected molecules, it seems that the container sets would be a better structure to store them in.

3) Container classes vs actual containers.

I used to totally agree with what you say here, but now I’m not so sure. Can we still have the situation of having noisy/silent data hierarchies (NoNotificationAtomContainerSet, etc) if the collection classes are implementations of java.util classes.

I do think that classes called ‘Set’ should behave like a set, or – if they actually behave like a list – they should have ‘List’ in the name. In other words, clases names should reflect behaviour.

On the subject of extending IChemObject; this could be more about having a base ChemObject rather than a base Object class. In other words, it is not so much a semantic issue, but a question of extending the java language in a very limited way.

I think that noisy / silent hierarchies are still possible via a List implementation. Of the top of my head, a subclass of List could call a notify or changed method before calling the appropriate method in the super class.

However, I must admit that when I raised this issue, I wasn’t entirely sure of the scope of usage of these classes. I suppose it boils down to asking whether the intended usage is simply as a convenient storage structure. Or are there “chemical” issues that are solved by using these specialized containers. (If these are closely tied to JChemPaint issues I have no idea about them). A later post will show some performance comparisons with ArrayList based storage.

4) Biopolymers

Excellent timing, as I have been looking at this yesterday. Biojava has a good model for biopolymers, but it could benefit greatly from using the CDK model for atoms, which is better.

As much as I favour reducing code duplication, I can’t really endorse abandoning CDK support for biopolymers in favour of biojava. Ultimately it would be better if biojava relied on the CDK…

Other improvements to the PDBPolymer and PDBReader include:

– Reading PDB files into PDBPolymer objects, rather than ChemFiles.
– Better support for ligands and waters.
– Renaming ‘Strand’ to ‘Chain’.
– Support for disordered atoms?

Gilleains suggestion sounds better than dropping PDB support completely from the CDK. Given the ubiquity of PDB files, it makes sense to directly support it in the CDK. However, the current situation is not ideal from a maintenance point of view. Of course if BioJava could be persuaded of the greatness of the CDK, this issue would be solved :)

So much to do, so little time, eh?

As always!

Discussion Topics for the CDK Workshop (from an absentee)

The CDK workshop is coming up at the EBI next week and it’s very frustrating to not be able to attend because of stupid US visa issues. While the workshop is not very long and already has an excellent program, I think it’d be useful to have a discussion on larger and broader issues regarding the design of the library. In that vein, I’ll list a number of issues that could be discussed. Some have bitten me, whereas others are from a more cursory point of view (since I’m not sufficiently expert in that area of the code).

In any case, all the best for the workshop and hope you guys have fun!

API Cleanup

A number of classes are used internal to the library, but are still public (mainly for testing purposes). These should ideally be package private so that they do not show in the public API, but are still testable by a separate test class.

An example is ChiIndexUtils which was recently converted to package private. Another example is the classes to read in and handle periodic table data. This patch proposes a refactoring that addresses this issue. The result is that from a users perspective, they need only track one class, PeriodicTable, to access periodic element data.

Other areas that could be addressed in this context include

  • Graph theory classes – there seem to be multiple implementations of BFS across the library. If the one in PathTools is not good enough, it should be revamped to be sufficiently general. I think the graph theory classes could do with some reworking to make it more compact. Ideally we’d also move to a new version of JGraphT, but this will require some major changes on the CDK side.
  • Are all the methods in GeometryTools really meant for public consumption? It seems many are specific to JChemPaint internals. Similarly in AtomTools, are the methods generally used? Or are they there for some specific reason?
  • It seems that certain classes in the dict package could be made package private
  • In BondTools, a number of methods seem to refer to atoms rather than bonds. Maybe they should be relocated?
  • In the rebond package, it seems that only RebondTool need be publicly available

Consistency

There are a number of cases, where the API is unintuitive or ambiguous. Examples include

IAtomContainer versus IMolecule

I know this has been discussed before, but it still seems that IMolecule is a syntactic sugar class – it adds nothing to IAtomContainer from which it derives. I just don’t see why it has to exist. Rather, it’s inclusion in the API leads to many cases where one cannot supply an IAtomContainer, but where it would be logical to do so. I think Stefans recent mail also addressed this issue.

Container classes versus actual containers

We have a number of container classes such as AtomContainerSet, RingSet and so on. Their semantics are essentially identical to that of List (rather than Set). Why do these classes not implement the List interface, rather than creating a container class from scratch? In this case they should be renamed to AtomContainerList etc to be consistent. Alternatively they should implement the Set interface if they truly are considered as sets.

Fundamentally, we should decide whether we want these container clases to behave as sets or lists. One might also argue as to the need for these specialized container classes, since the specialized containers don’t appear to do much (any?) more than standard Java lists. Thus it appears that in many cases (I haven’t looked at all of them) one could replace AtomContainerSet with List<IAtomContainer>.  Such an approach would also simplify the API.

Another question – why do they extend IChemObject? They appear to be containers for molecules, rings etc, rather than actual chemical objects.

Biopolymer handling

Given that handling PDB files is an intricate task, does the CDK really need one, when BioJava does it very well? It seems that some bridging code (possibly conditionally compiled) between the CDK and BioJava would be useful and more maintanble.

The BioPolymer hierarchy design needs some thoughtful redesign as evidenced by bug 2549813

Preprocessing ONS Solubility Data

With the semester winding up and preparing to move to Rockville, things have been a bit hectic. However, I’ve been trying to keep track of the ONS solubility modeling project and one of the irritating things is that each time I want to build a model (due to new data being submitted), I need to extract and clean the data from the Google spreadsheet. So, finally put together some Python code to get the solubility data, clean it up, filter out invalid rows (as noted by the DONOTUSE string) and optionally filter rows based on a specified string. This allows me to get the whole dataset at one go, or just the data for methanol etc. Note that it doesn’t dump all the columns from the original spreadsheet – just the columns I need for modeling. A very simplistic script that dumps the final data in tab-delimited format to STDOUT.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
## Rajarshi Guha
## 04/14/2009
## Update 04/20/2009 - include solute state column

import urllib
import csv
import getopt
import sys

def usage():
    print """
Usage: solsum.py [OPTIONS]

Retrieves the SolubilitiesSum spreadsheet from Google and
processes Sheet 1 to extract solubility data. Right now it just
pulls out the solute name and smiles, solvent name and solubility.
It will filter out entries that are marked as DONOTUSE. If desired
you can aggregate solubility values for multiple instances of the
same compound using a variety of functions.

The final data table is output as tab separated onto STDOUT.

OPTIONS can be

-h, --help            This message
-a, --agg FUNC        Aggregate function. Valid values can be
                      'mean', 'median', 'min', 'max'.
                      Currently this is not supported
-f, --filter STRING   Only include rows that have a column that exactly
                      matches the specified STRING
-d, --dry             Don't output data, just report some stats
"""


url = 'http://spreadsheets.google.com/pub?key=plwwufp30hfq0udnEmRD1aQ&output=csv&gid=0'

idx_solute_name = 3
idx_solute_smiles = 4
idx_solvent_name = 5
idx_conc = 7
idx_state = 24

if __name__ == '__main__':

    fstring = None
    agg = None
    dry = False

    solubilities = []

    try:
        opts, args = getopt.getopt(sys.argv[1:], "hdf:a:", ["help", "dry", "filter=", "agg="])
    except getopt.GetoptError:
        usage()
        sys.exit(-1)

    for opt, arg in opts:
        if opt in ('-h', '--help'):
            usage()
            sys.exit(1)
        elif opt in ('-f', '--filter'):
            fstring = arg
        elif opt in ('-d', '--dry'):
            dry = True

    data = csv.reader(urllib.urlopen(url))
    data.next()
    c = 2
    for row in data:
        line = [x.strip() for x in row]
        if len(line) != 25:
            print 'ERROR (Line %d) %s' % (c, ','.join(line))
            continue
        solubilities.append( (line[idx_solute_name],
                              line[idx_solute_smiles],
                              line[idx_solvent_name],
                              line[idx_conc],
                              line[idx_state]) )
        c += 1

    if dry:
        print 'Got %d entries' % (len(solubilities))
    solubilities = [ x for x in solubilities if x[0].find("DONOTUSE") == -1]
    if dry:
        print 'Got %d entries after filtering invalid entries' % (len(solubilities))

    if not dry:
        for row in solubilities:
            if fstring:
                if any(map(lambda x: x == fstring, row)):
                    print '\t'.join([str(x) for x in row])
                    continue
            else:
                print '\t'.join([str(x) for x in row])

Back from the Mountains and the Desert

Got back from the ACS meeting in Salt Lake City. As usual, quite a hectic week, more so this time since this was the first meeting in which I was Program Chair for CINF. With the exception of a few glitches, I think it went well – especially our first talk given via Skype! Met up with lots of people, old friends and new, and got lots of input for future programming – very interesting stuff coming up in the next few meetings.

Blue Obelisk Dinner

Blue Obelisk Dinner

I must recommend the Hotel Monaco –  a boutique hotel, slightly expensive but very good value (excellent rooms, free wireless access, complimentary wine tasting). Also across the road was Seigfrieds Deli, which served very good German cuisine. We also had a Blue Obelisk meeting at Martine – which served great tapas (and their dipping olive oil was the best I’ve ever had).

After the meeting, I met up with my wife in Las Vegas (not a place I’d like to visit again) and then headed to Death Valley for two days. Such a short time is not enough to do justice to the place. The landscape is  both breathtaking and humbling. Luckily it wasn’t scorching yet, but we worked up a good sweat on some hikes. It still amazes me that life will find a way to survive in extreme conditions – I had always thought of Death Valley as desolate, but in fact, it teems with life. I suppose I’ll have to visit the Gobi or Atacama to experience real desolation.