More on the CDK Design Discussion

Based on my previous post, Gilleain had posted an extensive comment on the issues I had raised. I thought it’d be useful to bring those comments to the top level and add some of my own.

1) API Cleanup.

I would prefer if all access modifiers were explicit – especially package, since that is the default level and it is important to show that it is intentional, not just missing.

In general, I do agree that internal-only classes/methods/members should be private so that it is clear what should be used, and what should not.

graph stuff/jgrapt : I agree that some of the basic graph algorithms (like BFS) should be improved. I would favour upgrading jgrapht, despite the pain (version 0.6->0.8!).

geometry tools : hmmm. A difficult one. They may only be used in the 2D layout, but the CDK is a library, and it seems cruel to deny users the use of these methods.

atom tools : the methods calculate3DCoordinates[1,2,3] are really internal to that class and only used from add3dCoordinates1 – which is not used anywhere in the cdk. It looks like useful code, but…

bond tools: again, useful code, but needs better method naming and maybe methods moved to more appropriate classes. Both these classes are a kind of anti-pattern of ‘bunch of methods class’ – or whatever it is called.

Gilleains’ point regarding GeometryTools is valid – it’s useful to have a good repertoire of geometry handling methods. But it might be useful to try and pare down the class and just make public the really general methods.

2) IMolecule.

I totally agree. The only reason I can see for this class is for the psychological comfort of having a class called ‘Molecule’

More seriously, I think Christoph justified Molecule’s existence by making a distinction between disconnected atom containers and fully connected ‘molecules’.

I strongly believe that this kind of distinction should be made using a method; internal, like atomContainer.isFullyConnected(), or external, like AtomContainerManipulator.isConnected(atomContainer). Use of a type to enforce/distinguish this is a bad idea, as it leads to all sorts of horrible situations.

I think Gilleains suggestion of a specific method to determine whether a molecule is fully connected or not definitely makes sense. Furthermore, if the original distinction was indeed based on fully connected molecules, it seems that the container sets would be a better structure to store them in.

3) Container classes vs actual containers.

I used to totally agree with what you say here, but now I’m not so sure. Can we still have the situation of having noisy/silent data hierarchies (NoNotificationAtomContainerSet, etc) if the collection classes are implementations of java.util classes.

I do think that classes called ‘Set’ should behave like a set, or – if they actually behave like a list – they should have ‘List’ in the name. In other words, clases names should reflect behaviour.

On the subject of extending IChemObject; this could be more about having a base ChemObject rather than a base Object class. In other words, it is not so much a semantic issue, but a question of extending the java language in a very limited way.

I think that noisy / silent hierarchies are still possible via a List implementation. Of the top of my head, a subclass of List could call a notify or changed method before calling the appropriate method in the super class.

However, I must admit that when I raised this issue, I wasn’t entirely sure of the scope of usage of these classes. I suppose it boils down to asking whether the intended usage is simply as a convenient storage structure. Or are there “chemical” issues that are solved by using these specialized containers. (If these are closely tied to JChemPaint issues I have no idea about them). A later post will show some performance comparisons with ArrayList based storage.

4) Biopolymers

Excellent timing, as I have been looking at this yesterday. Biojava has a good model for biopolymers, but it could benefit greatly from using the CDK model for atoms, which is better.

As much as I favour reducing code duplication, I can’t really endorse abandoning CDK support for biopolymers in favour of biojava. Ultimately it would be better if biojava relied on the CDK…

Other improvements to the PDBPolymer and PDBReader include:

– Reading PDB files into PDBPolymer objects, rather than ChemFiles.
– Better support for ligands and waters.
– Renaming ‘Strand’ to ‘Chain’.
– Support for disordered atoms?

Gilleains suggestion sounds better than dropping PDB support completely from the CDK. Given the ubiquity of PDB files, it makes sense to directly support it in the CDK. However, the current situation is not ideal from a maintenance point of view. Of course if BioJava could be persuaded of the greatness of the CDK, this issue would be solved

So much to do, so little time, eh?

As always!

One thought on “More on the CDK Design Discussion”

Performance of AtomContainerSet versus ArrayList at So much to do, so little time says:

April 15, 2009 at 5:29 pm

[…] and other related specialized container classes as opposed to using parametrized List objects. Gilleain had mentioned some issues that might require these specialized classes. But at this point […]

So much to do, so little time

Trying to squeeze sense out of chemical data

One thought on “More on the CDK Design Discussion”

Leave a Reply Cancel reply