Discussion Topics for the CDK Workshop (from an absentee)

The CDK workshop is coming up at the EBI next week and it’s very frustrating to not be able to attend because of stupid US visa issues. While the workshop is not very long and already has an excellent program, I think it’d be useful to have a discussion on larger and broader issues regarding the design of the library. In that vein, I’ll list a number of issues that could be discussed. Some have bitten me, whereas others are from a more cursory point of view (since I’m not sufficiently expert in that area of the code).

In any case, all the best for the workshop and hope you guys have fun!

API Cleanup

A number of classes are used internal to the library, but are still public (mainly for testing purposes). These should ideally be package private so that they do not show in the public API, but are still testable by a separate test class.

An example is ChiIndexUtils which was recently converted to package private. Another example is the classes to read in and handle periodic table data. This patch proposes a refactoring that addresses this issue. The result is that from a users perspective, they need only track one class, PeriodicTable, to access periodic element data.

Other areas that could be addressed in this context include

  • Graph theory classes – there seem to be multiple implementations of BFS across the library. If the one in PathTools is not good enough, it should be revamped to be sufficiently general. I think the graph theory classes could do with some reworking to make it more compact. Ideally we’d also move to a new version of JGraphT, but this will require some major changes on the CDK side.
  • Are all the methods in GeometryTools really meant for public consumption? It seems many are specific to JChemPaint internals. Similarly in AtomTools, are the methods generally used? Or are they there for some specific reason?
  • It seems that certain classes in the dict package could be made package private
  • In BondTools, a number of methods seem to refer to atoms rather than bonds. Maybe they should be relocated?
  • In the rebond package, it seems that only RebondTool need be publicly available

Consistency

There are a number of cases, where the API is unintuitive or ambiguous. Examples include

IAtomContainer versus IMolecule

I know this has been discussed before, but it still seems that IMolecule is a syntactic sugar class – it adds nothing to IAtomContainer from which it derives. I just don’t see why it has to exist. Rather, it’s inclusion in the API leads to many cases where one cannot supply an IAtomContainer, but where it would be logical to do so. I think Stefans recent mail also addressed this issue.

Container classes versus actual containers

We have a number of container classes such as AtomContainerSet, RingSet and so on. Their semantics are essentially identical to that of List (rather than Set). Why do these classes not implement the List interface, rather than creating a container class from scratch? In this case they should be renamed to AtomContainerList etc to be consistent. Alternatively they should implement the Set interface if they truly are considered as sets.

Fundamentally, we should decide whether we want these container clases to behave as sets or lists. One might also argue as to the need for these specialized container classes, since the specialized containers don’t appear to do much (any?) more than standard Java lists. Thus it appears that in many cases (I haven’t looked at all of them) one could replace AtomContainerSet with List<IAtomContainer>.  Such an approach would also simplify the API.

Another question – why do they extend IChemObject? They appear to be containers for molecules, rings etc, rather than actual chemical objects.

Biopolymer handling

Given that handling PDB files is an intricate task, does the CDK really need one, when BioJava does it very well? It seems that some bridging code (possibly conditionally compiled) between the CDK and BioJava would be useful and more maintanble.

The BioPolymer hierarchy design needs some thoughtful redesign as evidenced by bug 2549813

6 thoughts on “Discussion Topics for the CDK Workshop (from an absentee)

  1. gilleain says:

    Oddly, I have much the same list of topics :)

    1) API Cleanup.

    I would prefer if all access modifiers were explicit – especially package, since that is the default level and it is important to show that it is intentional, not just missing.

    In general, I do agree that internal-only classes/methods/members should be private so that it is clear what should be used, and what should not.

    graph stuff/jgrapt : I agree that some of the basic graph algorithms (like BFS) should be improved. I would favour upgrading jgrapht, despite the pain (version 0.6->0.8!).

    geometry tools : hmmm. A difficult one. They may only be used in the 2D layout, but the CDK is a library, and it seems cruel to deny users the use of these methods.

    atom tools : the methods calculate3DCoordinates[1,2,3] are really internal to that class and only used from add3dCoordinates1 – which is not used anywhere in the cdk. It looks like useful code, but…

    bond tools: again, useful code, but needs better method naming and maybe methods moved to more appropriate classes. Both these classes are a kind of anti-pattern of ‘bunch of methods class’ – or whatever it is called.

    2) IMolecule.

    I totally agree. The only reason I can see for this class is for the psychological comfort of having a class called ‘Molecule’ :)

    More seriously, I think Christoph justified Molecule’s existence by making a distinction between disconnected atom containers and fully connected ‘molecules’.

    I strongly believe that this kind of distinction should be made using a method; internal, like atomContainer.isFullyConnected(), or external, like AtomContainerManipulator.isConnected(atomContainer). Use of a type to enforce/distinguish this is a bad idea, as it leads to all sorts of horrible situations.

    3) Container classes vs actual containers.

    I used to totally agree with what you say here, but now I’m not so sure. Can we still have the situation of having noisy/silent data hierarchies (NoNotificationAtomContainerSet, etc) if the collection classes are implementations of java.util classes.

    I do think that classes called ‘Set’ should behave like a set, or – if they actually behave like a list – they should have ‘List’ in the name. In other words, clases names should reflect behaviour.

    On the subject of extending IChemObject; this could be more about having a base ChemObject rather than a base Object class. In other words, it is not so much a semantic issue, but a question of extending the java language in a very limited way.

    4) Biopolymers

    Excellent timing, as I have been looking at this yesterday. Biojava has a good model for biopolymers, but it could benefit greatly from using the CDK model for atoms, which is better.

    As much as I favour reducing code duplication, I can’t really endorse abandoning CDK support for biopolymers in favour of biojava. Ultimately it would be better if biojava relied on the CDK…

    Other improvements to the PDBPolymer and PDBReader include:

    – Reading PDB files into PDBPolymer objects, rather than ChemFiles.
    – Better support for ligands and waters.
    – Renaming ‘Strand’ to ‘Chain’.
    – Support for disordered atoms?

    So much to do, so little time, eh? :)

  2. Thanx for writing this up… I suggest we start hacking on a API proposal to clean up these issues… one thing I would like to bring in:

    IAtom that is a not an actual atom, and does not have an IElement or IIsotope…

  3. Oh, and now I forgot to mention that in the past people have requested to make API public so that they could more easily overwrite/extend the functionality…

    All *internal* code may very well be very useful cheminformatics algorithm, so use ‘private’ with care… in retrospect, I should have thought more carefully about the ChiIndexUtils patch…

  4. Gilleain – thanks for the comments. I think I’ll reply in more detail in a separate blog post, so that your comments are more visible.

    Egon – I agree, some methods should be publicly available so that users can extend the library. My mail with a patch for Bspt was an example of how not to do it :(

    WRT, ChiIndexUtils – I think that is clearly justified as package private. In fact, if we did not have separate classes for each class of Chi descriptor, the utils class would not be a separate class at all.

  5. […] on my previous post, Gilleain had posted an extensive comment on the issues I had raised. I thought it’d be […]

  6. […] my previous post, I had asked whether we really need AtomContainerSet and other related specialized container […]

Leave a Reply

Your email address will not be published. Required fields are marked *