The CDK workshop is coming up at the EBI next week and it’s very frustrating to not be able to attend because of stupid US visa issues. While the workshop is not very long and already has an excellent program, I think it’d be useful to have a discussion on larger and broader issues regarding the design of the library. In that vein, I’ll list a number of issues that could be discussed. Some have bitten me, whereas others are from a more cursory point of view (since I’m not sufficiently expert in that area of the code).
In any case, all the best for the workshop and hope you guys have fun!
A number of classes are used internal to the library, but are still public (mainly for testing purposes). These should ideally be package private so that they do not show in the public API, but are still testable by a separate test class.
An example is ChiIndexUtils which was recently converted to package private. Another example is the classes to read in and handle periodic table data. This patch proposes a refactoring that addresses this issue. The result is that from a users perspective, they need only track one class, PeriodicTable, to access periodic element data.
Other areas that could be addressed in this context include
- Graph theory classes – there seem to be multiple implementations of BFS across the library. If the one in PathTools is not good enough, it should be revamped to be sufficiently general. I think the graph theory classes could do with some reworking to make it more compact. Ideally we’d also move to a new version of JGraphT, but this will require some major changes on the CDK side.
- Are all the methods in GeometryTools really meant for public consumption? It seems many are specific to JChemPaint internals. Similarly in AtomTools, are the methods generally used? Or are they there for some specific reason?
- It seems that certain classes in the dict package could be made package private
- In BondTools, a number of methods seem to refer to atoms rather than bonds. Maybe they should be relocated?
- In the rebond package, it seems that only RebondTool need be publicly available
There are a number of cases, where the API is unintuitive or ambiguous. Examples include
- Bug 2101109 describes an inconsistency in the IRingSet API
- The BondTools case noted above is another example
IAtomContainer versus IMolecule
I know this has been discussed before, but it still seems that IMolecule is a syntactic sugar class – it adds nothing to IAtomContainer from which it derives. I just don’t see why it has to exist. Rather, it’s inclusion in the API leads to many cases where one cannot supply an IAtomContainer, but where it would be logical to do so. I think Stefans recent mail also addressed this issue.
Container classes versus actual containers
We have a number of container classes such as AtomContainerSet, RingSet and so on. Their semantics are essentially identical to that of List (rather than Set). Why do these classes not implement the List interface, rather than creating a container class from scratch? In this case they should be renamed to AtomContainerList etc to be consistent. Alternatively they should implement the Set interface if they truly are considered as sets.
Fundamentally, we should decide whether we want these container clases to behave as sets or lists. One might also argue as to the need for these specialized container classes, since the specialized containers don’t appear to do much (any?) more than standard Java lists. Thus it appears that in many cases (I haven’t looked at all of them) one could replace AtomContainerSet with List<IAtomContainer>. Such an approach would also simplify the API.
Another question – why do they extend IChemObject? They appear to be containers for molecules, rings etc, rather than actual chemical objects.
Given that handling PDB files is an intricate task, does the CDK really need one, when BioJava does it very well? It seems that some bridging code (possibly conditionally compiled) between the CDK and BioJava would be useful and more maintanble.