Archive for April, 2009
Today was the final class of the graduate course I taught this semester. What with traveling and job hunting, it wasn’t as thorough as I would’ve liked (really want to cover Hadoop, EC2 etc). In any case, I’ve put up the slides from the class for posterity.
Sometime back I had described some work on the automated annotation of PubChem bioassays. The lack of annotations on the assays can make it difficult to integrate with other biological resources. Ideally, the bioassays would be manually annotated – however, it’s not a very exciting job. So, collaborating with Patrick Ruch and Julien Gobeill, we used their tool, GOCat, to automatically annotate the PubChem bioassay collection with GO terms. They recently presented a poster on this work at the 3rd International Biocuration Conference in Berlin.
Obviously, automated annotation will not be as good as expert, manual annotations. However it does a decent job and I think it’s in line with a recent post by Duncan Hull, where he quotes a paper from Google
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available
While we’re not using the PubChem assay data directly for learning, the automated approach to annotations means that we can move on to stuff that can make use of them, rather than waiting on a full manual curation of the assay collection (which will likely supercede automated annotations, when it becomes available).
In my previous post, I had asked whether we really need AtomContainerSet and other related specialized container classes as opposed to using parametrized List objects. Gilleain had mentioned some issues that might require these specialized classes. But at this point it’s not clear to me what the intended goal for these classes was.
For now, assuming that they are used purely as storage classes, I decided to do some performance testing, comparing AtomContainerSet to ArrayList<IAtomContainer>. Currently these results are based on System.currentTimeMillis() as my YourKit licence has expired. So the results are not as fine grained as I’d like.
To perform the tests I read in 2000 3D structures taken from Pub3D and loaded them into an IAtomContainer. I then considered four operations
- Given an empty AtomContainerSet, we simply loop over the 2000-element array and add each molecule to the empty container using addAtomContainer. We do the same thing with an empty ArrayList<IAtomContainer>, but use add. The operation is repeated 100 times and the mean time for the addition of 2000 molecules to the empty containers is taken.
- Randomly access elements of the AtomContainerSet and ArrayList, 50,000 times
- Access each element serially (using for (IAtomContainer x : container) idiom). Averaged 100 times
- Remove 500 molecules by reference, randomly from each container structure
- Remove 500 molecules by index, randomly from each container structure
The summary view of the results is shown below. For each pair of bars, the time taken for the operation on the AtomContainerSet is normalized to 1.0 and the time with the ArrayList is appropriately converted.
The raw timings are shown below:
|Operation||AtomContainerSet (sec)||ArrayList (sec)|
|Indexed Random Access||0.006||0.007|
|Remove (by reference)||68.119||0.1|
|Remove (by index)||0.695||0.0|
As you can see, the ArrayList is significantly faster in all except random and serial access. In that case, if you consider the raw numbers, the run times are essentially equivalent for random access, though serial access of an array in AtomContainerSet gives it an edge over ArrayList.
So if AtomContainerSet and related classes are purely for storage purposes, they could easily be replaced by ArrayList’s. However the fact that they include support for notifications suggests somebody is using it for more complex scenarios. In either case, a discussion of the design and role of these classes would be useful
Based on my previous post, Gilleain had posted an extensive comment on the issues I had raised. I thought it’d be useful to bring those comments to the top level and add some of my own.
1) API Cleanup.
I would prefer if all access modifiers were explicit – especially package, since that is the default level and it is important to show that it is intentional, not just missing.
In general, I do agree that internal-only classes/methods/members should be private so that it is clear what should be used, and what should not.
graph stuff/jgrapt : I agree that some of the basic graph algorithms (like BFS) should be improved. I would favour upgrading jgrapht, despite the pain (version 0.6->0.8!).
geometry tools : hmmm. A difficult one. They may only be used in the 2D layout, but the CDK is a library, and it seems cruel to deny users the use of these methods.
atom tools : the methods calculate3DCoordinates[1,2,3] are really internal to that class and only used from add3dCoordinates1 – which is not used anywhere in the cdk. It looks like useful code, but…
bond tools: again, useful code, but needs better method naming and maybe methods moved to more appropriate classes. Both these classes are a kind of anti-pattern of ‘bunch of methods class’ – or whatever it is called.
Gilleains’ point regarding GeometryTools is valid – it’s useful to have a good repertoire of geometry handling methods. But it might be useful to try and pare down the class and just make public the really general methods.
I totally agree. The only reason I can see for this class is for the psychological comfort of having a class called ‘Molecule’
More seriously, I think Christoph justified Molecule’s existence by making a distinction between disconnected atom containers and fully connected ‘molecules’.
I strongly believe that this kind of distinction should be made using a method; internal, like atomContainer.isFullyConnected(), or external, like AtomContainerManipulator.isConnected(atomContainer). Use of a type to enforce/distinguish this is a bad idea, as it leads to all sorts of horrible situations.
I think Gilleains suggestion of a specific method to determine whether a molecule is fully connected or not definitely makes sense. Furthermore, if the original distinction was indeed based on fully connected molecules, it seems that the container sets would be a better structure to store them in.
3) Container classes vs actual containers.
I used to totally agree with what you say here, but now I’m not so sure. Can we still have the situation of having noisy/silent data hierarchies (NoNotificationAtomContainerSet, etc) if the collection classes are implementations of java.util classes.
I do think that classes called ‘Set’ should behave like a set, or – if they actually behave like a list – they should have ‘List’ in the name. In other words, clases names should reflect behaviour.
On the subject of extending IChemObject; this could be more about having a base ChemObject rather than a base Object class. In other words, it is not so much a semantic issue, but a question of extending the java language in a very limited way.
I think that noisy / silent hierarchies are still possible via a List implementation. Of the top of my head, a subclass of List could call a notify or changed method before calling the appropriate method in the super class.
However, I must admit that when I raised this issue, I wasn’t entirely sure of the scope of usage of these classes. I suppose it boils down to asking whether the intended usage is simply as a convenient storage structure. Or are there “chemical” issues that are solved by using these specialized containers. (If these are closely tied to JChemPaint issues I have no idea about them). A later post will show some performance comparisons with ArrayList based storage.
Excellent timing, as I have been looking at this yesterday. Biojava has a good model for biopolymers, but it could benefit greatly from using the CDK model for atoms, which is better.
As much as I favour reducing code duplication, I can’t really endorse abandoning CDK support for biopolymers in favour of biojava. Ultimately it would be better if biojava relied on the CDK…
Other improvements to the PDBPolymer and PDBReader include:
- Reading PDB files into PDBPolymer objects, rather than ChemFiles.
- Better support for ligands and waters.
- Renaming ‘Strand’ to ‘Chain’.
- Support for disordered atoms?
Gilleains suggestion sounds better than dropping PDB support completely from the CDK. Given the ubiquity of PDB files, it makes sense to directly support it in the CDK. However, the current situation is not ideal from a maintenance point of view. Of course if BioJava could be persuaded of the greatness of the CDK, this issue would be solved
So much to do, so little time, eh?
The CDK workshop is coming up at the EBI next week and it’s very frustrating to not be able to attend because of stupid US visa issues. While the workshop is not very long and already has an excellent program, I think it’d be useful to have a discussion on larger and broader issues regarding the design of the library. In that vein, I’ll list a number of issues that could be discussed. Some have bitten me, whereas others are from a more cursory point of view (since I’m not sufficiently expert in that area of the code).
In any case, all the best for the workshop and hope you guys have fun!
A number of classes are used internal to the library, but are still public (mainly for testing purposes). These should ideally be package private so that they do not show in the public API, but are still testable by a separate test class.
An example is ChiIndexUtils which was recently converted to package private. Another example is the classes to read in and handle periodic table data. This patch proposes a refactoring that addresses this issue. The result is that from a users perspective, they need only track one class, PeriodicTable, to access periodic element data.
Other areas that could be addressed in this context include
- Graph theory classes – there seem to be multiple implementations of BFS across the library. If the one in PathTools is not good enough, it should be revamped to be sufficiently general. I think the graph theory classes could do with some reworking to make it more compact. Ideally we’d also move to a new version of JGraphT, but this will require some major changes on the CDK side.
- Are all the methods in GeometryTools really meant for public consumption? It seems many are specific to JChemPaint internals. Similarly in AtomTools, are the methods generally used? Or are they there for some specific reason?
- It seems that certain classes in the dict package could be made package private
- In BondTools, a number of methods seem to refer to atoms rather than bonds. Maybe they should be relocated?
- In the rebond package, it seems that only RebondTool need be publicly available
There are a number of cases, where the API is unintuitive or ambiguous. Examples include
- Bug 2101109 describes an inconsistency in the IRingSet API
- The BondTools case noted above is another example
IAtomContainer versus IMolecule
I know this has been discussed before, but it still seems that IMolecule is a syntactic sugar class – it adds nothing to IAtomContainer from which it derives. I just don’t see why it has to exist. Rather, it’s inclusion in the API leads to many cases where one cannot supply an IAtomContainer, but where it would be logical to do so. I think Stefans recent mail also addressed this issue.
Container classes versus actual containers
We have a number of container classes such as AtomContainerSet, RingSet and so on. Their semantics are essentially identical to that of List (rather than Set). Why do these classes not implement the List interface, rather than creating a container class from scratch? In this case they should be renamed to AtomContainerList etc to be consistent. Alternatively they should implement the Set interface if they truly are considered as sets.
Fundamentally, we should decide whether we want these container clases to behave as sets or lists. One might also argue as to the need for these specialized container classes, since the specialized containers don’t appear to do much (any?) more than standard Java lists. Thus it appears that in many cases (I haven’t looked at all of them) one could replace AtomContainerSet with List<IAtomContainer>. Such an approach would also simplify the API.
Another question – why do they extend IChemObject? They appear to be containers for molecules, rings etc, rather than actual chemical objects.
Given that handling PDB files is an intricate task, does the CDK really need one, when BioJava does it very well? It seems that some bridging code (possibly conditionally compiled) between the CDK and BioJava would be useful and more maintanble.