Squeezing Journal Space?

An article by Steven Bachrach in the Journal of Cheminformatics has an excellent disscusion on Open Access journals. He notes that one of the problems with current scientific publishing is the plethora of journals. He also points out the huge amount of publications being generated. He succinctly states it as

We simply publish way too much

No one can keep up with a literature like this. It is time for the scientific community to rethink the role of publication. Should every little idea, every minor work receive the same treatment as the great discovery?

Most articles barely get cited – many never get cited; they are simply maintained within this ever-growing collective of scientific work, with the chaff growing at a rate exceeding the production of the wheat

He then proposes a two pronged approach in which 1) we reduce the number of full service journals, restricting it to just the top ranked journals and 2) the remaining journals  publishing the “remaining 80% of the scientific literature” be replaced with institutional respositories.

I think this is a very sensible approach. But I see one major question that needs to be considered, before this approach might take off. And that is, why do we publish so much? As opposed to, say, putting papers up on web pages.

I think it wouldn’t be far from the mark, to say that a major push for publishing is credit. From an idealistic point of view, one can say that scientific results need to be disseminated. Therefore, by publishing in journals which everyone reads, we achieve this result. But the fact is, publications are academic currency – they help academics get tenure and get money. Ideally, committees and funders would be able to differentiate between the really significant publications and the “chaff” and maybe even penalize the latter. That doesn’t seem to be the case – hence publish whatever you can, whenever you can.

But, if this is the current atmosphere, what are the chances of success for Bachrachs proposal? Clearly his first point would be supported – publications in Tier 1 journals would be a bundle of currency. But that is the case, even now. The problem is that the bulk of scientific output would end up in repositories. Personally, that would be fine by me – easy access to papers and data, ability to interact with authors and so on. But, would anything and everything by an institutions scientists be deposited in such a repository? If so, how does one differentiate between the good and the bad? Maybe good papers would get a lot of comments and feedback? Bachrach already notes that participation in journal supported forums is generally low. Why would participation in  forums maintained by institutional repositories be any greater?

Personally, I’d love to be in a universe where I could simply write up a study and publish it on the web along with code and data and then move on. Unfortunately, in this universe, resources are limited and I must compete. Hence some form of external validation (or at least the appearance thereof)  comes into play. And as a result I have to play the publishing game.

Bachrach mentions that Open Notebook Science as a publishing model is far too radical for near term adoption. I agree with this. But it seems that pushing 80% of the literature into institutional repositories also requires a fundamental (even radical?) rethinking of how academia rewards acheivement.

Update: As noted by Egon, the world is a better place if one links to DOI’s rather than PDF’s. In that spirit I updated the link to the Bachrach paper to point to the DOI.

Change – Yes, I Did

I’ve been at IU for 3 years now, the last two as a visiting faculty member. However the time has come for a change. I have accepted a position at the NIH Chemical Genomics Center working in the informatics group and will start there in mid-May.

I’m pretty excited about this move, as the group is doing some pretty cool stuff with cheminformatics and high-throughput assays. The group releases their code into the public domain and you can see some of their stuff here. From a scientific point of view, this is an excellent place to do cheminformatics (as well as informatics and computation in general) because the biologists and chemists are right next door. This means access to fresh, raw data as well as the possibility of working with wet chemistry and biology to test out in silico analyses and predictions. (They also have a neat robotics set up, also next door).

As part of the informatics group, I will have fingers in many pies – modeling and computation, developing software, data management and project support. I’m particularly excited about the forthcoming RNAi screening infrastructure that is being set up at the Center.

What happens to the web services?

All services are currently running, but will not be available at the rguha.ath.cx URL’s after mid May. The services have been transferred to David Wild and I will try and redirect URL’s before I leave – if you find that a service you are using is not working, get in touch with David. However, the source code for the SOAP services are freely available on SourceForge, so you can always deploy them into your local Tomcat container. However, I do not plan on supporting these services any longer.

As I have described previously, I am shifting to REST based services. Thus all the CDK based cheminformatics services are converted to REST forms and are currently hosted of rguha.ath.cx, but after I move, this URL will no longer be available. However, running these services is very easy and is described here. If anybody is interested in hosting these REST services, please get in touch with me. In terms of future support, I plan to work on these REST services in my spare time. Note that these services replace a number of my mod_python based REST services (see the GitHub repository which I will updated shortly to remove deprecated services). In general, I plan to maintain the Python REST services. In particular, some of these  services are still front-ends to SOAP services. Where possible, I will reomove the dependency on the SOAP service. The prediction service is something I plan to work on as it is a very light weight way to deploy predictive models built using R (and less hassle than going via the SOAP based R infrastructure). Unfortunately, until I work out some server solution, the predictive model service will likely not be publicly accessible (as a service, that is).

Databases

Over the last three years I’ve been maintaining a partial mirror of PubChem and the Pub3D database. After my move I will no longer be handling these, so questions should be sent to David. A number of services that I provide make use of these databases. Examples include, 3D structure search, InChI/InChI key  resolution, synonyms and so on. These services will continue to run, though the URL’s will likely change. – as before, if you see something is down, David should be able to point you in the right direction. I will likely continue work on the 3D database (especially the similarity search aspects, which is currently flawed), but at this point I don’t know when or how new results will be posted.

I should point out that future work on the projects described here are spare-time projects and not part of my official duties at NCGC. So progress on these is not going to be particularly fast.

Finally, my website has moved to http://rguha.net and the old one should automatically redirect you to the new one.

CDK Development with Git

As announced on the cdk-devel mailing list, the project has shifted to Git as its version control system. While the SVN repository is still available, it’s expected that all development from now on will be done using Git branches. To get things going Egon has put up a very nice page on the CDK wiki.

The use of Git makes development much more enjoyable – if nothing else, the fast branch capability allows one to easily test out multiple, disparate ideas and cleanly merge them back into main line code. Another side effect of this is that it also allows easy code review, which is extremely valuable in a distributed project such as the CDK. In this post I thought it’d be useful to describe my workflow using Git.

Set up

I won’t say much about this as the wiki page provides sufficient detail. In brief, we make a clone of the CDK master (located on Sourceforge)

1
git clone git://cdk.git.sourceforge.net/gitroot/cdk

This gives us a directory called cdk. This is your local copy of the CDK master branch and periodically, you should do a git pull to make sure it’s in sync with what is on Sourceforge. Now, rather than work directly on this local copy, we make branches for each idea we want to try out.

1
git checkout -b myIdea1

This way you can be working on multiple projects, each independent of each other. For example my setup looks like

1
2
3
4
5
6
[rguha@localhost cdk]$ git branch -a
  master
* pcore
  origin/HEAD
  origin/cdk-1.2.x
  origin/master

The asterisk indicates I’m working on my pcore branch. master represents my local copy of the Sourceforge master.

Making branches available

Given a git repository there are a number of ways you can make it available. Approaches include emailing patches, creating bundles or setting up your own Git server. The last is worthwhile if you have a constant connection and multiple developers working on your code. Such a server can be based on Apache and WebDAV and is described here. Another alternative is to use Gitosis, which makes life quite easy, though I haven’t tried it myself.

In my case, I use a shortcut (this is based on a Linux server). When I want to make stuff available to the world, I simply run git-daemon and point it to my local CDK repository

1
sudo -u git git-daemon --base-path /home/rguha/src/java/cdk.git --export-all

Here the base-path indicates the directory that contains the repository directory. By doing export-all we indicate that all repositories under base-path be made available. In my case, there’s just the CDK repository. I created a user called git so that it’s permissions are a little restricted. However, this is likely not a very secure setup. I haven’t checked whether other users can write to my repository (which I’d rather not have them do!). Also, since I’m not going through xinetd I miss out on those benefits as well. Also see here on the use of git-daemon.

Having said that, this setup is quite handy – since whenever I want others to look at a branch of mine (say for merging with the Sourceforge master) I can simply ask them to do

1
git pull git://rguha.ath.cx/cdk pcore

and they will be able to pull all the latest commits from the pcore branch. If the work in this branch gets accepted I can simply shut down the daemon, until I decide to make some new work available to the world. Since this is my own machine and I’m the only developer working on it, I don’t need a persistent Git server, so this works great.

My workflow

So, now that I can easily make branches available to the world, how do I arrange my workflow? The approach I’ve taken is the following. My local master never has work done on it directly. The only operation I perform is pull. In other words I just make sure that it’s in sync with the Sourceforge master. (Of course for really minor edits I might just make them in my local master and make them available). All other ideas are tested out in a branch. Before working on a branch I’d switch to master, do a pull, switch to a branch I’d like to work on and then do a rebase (see here for a nice explanation) in this branch, so that it’s in sync with the Sourceforge master.

Now, the CDK policy is that only the branch manager will be able to write to the Sourceforge master. Thus I cannot push changes to it and must request a review of my code – which I can do by starting the daemon as described above and pointing people to the branch. Once it’s accepted and merged into the Sourceforge master I can then pull to my local master. As an example here’s the commands that make up my workfllow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
$: git checkout master
Switched to branch "master"

$: git pull
remote: Counting objects: 13, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 8 (delta 6), reused 0 (delta 0)
Unpacking objects: 100% (8/8), done.
From git://cdk.git.sourceforge.net/gitroot/cdk
   f927231..a4f56f2  cdk-1.2.x  -> origin/cdk-1.2.x
   5bab4a2..2a2bfe6  master     -> origin/master
Updating 5bab4a2..2a2bfe6
Fast forward
 README                       |    2 +-
 src/META-INF/extra.datafiles |    2 --
 2 files changed, 1 insertions(+), 3 deletions(-)

$ git checkout pcore
Switched to branch "pcore"

$ git rebase master
First, rewinding head to replay your work on top of it...
Applying: provided toString methods for the pcore query classes
Applying: Added test methods for the new toString methods. Also added test method annotations
/home/rguha/src/java/cdk.git/cdk/.git/rebase-apply/patch:76: trailing whitespace.
        Assert.assertEquals("AC::Amine [[CX2]N]::aromatic [c1ccccc1]::blah [C]::[54.74 - 54.74]", repr);        
/home/rguha/src/java/cdk.git/cdk/.git/rebase-apply/patch:110: trailing whitespace.
        Assert.assertEquals("DC::Amine [[CX2]N]::aromatic [c1ccccc1]::[1.0 - 2.0]", repr);        
warning: 2 lines add whitespace errors.
Applying: Updated the toString tests
/home/rguha/src/java/cdk.git/cdk/.git/rebase-apply/patch:15: trailing whitespace.
        Assert.assertEquals(0, repr.indexOf("AC::Amine [[CX2]N]::aromatic [c1ccccc1]::blah [C]::[54.74 - 54.74]"));        
/home/rguha/src/java/cdk.git/cdk/.git/rebase-apply/patch:28: trailing whitespace.
        String repr = qbond1.toString();        
/home/rguha/src/java/cdk.git/cdk/.git/rebase-apply/patch:29: trailing whitespace.
        Assert.assertEquals(0, repr.indexOf("DC::Amine [[CX2]N]::aromatic [c1ccccc1]::[1.0 - 2.0]"));        
warning: 3 lines add whitespace errors.
Applying: Refactored to provide a query container specifically for pharmacophore queries.

# At this point, my pcore branch is synced up with the SF master
# do some work on this branch, make it available, hear that it's merged into SF master

$: git checkout master
Switched to branch "master"

$: git pull

# At this point I should have my commits from the pcore branch

The key thing is that rather than merge my branches into my local master and then point people to that, I instead point people to my branches. After my branch has been merged with the Sourceforge master, I pull the changes into my local master, followed by rebasing in my branches. While this does seem a little convoluted I can work on multiple projects, and not conflate them in my local master.

Circular Fingerprints with the CDK and Clojure

One of the things I’m liking about Clojure is that it can be used as a quick prototyping language, a lot like Python. This is quite handy when playing with the CDK, as I can avoid a lot of the verbosity of Java code. As an example, I put together a very simplistic circular fingerprint (exemplified by molprint2d and ECFP’s). In this case, the fingerprint is simply composed of atom type names and is quite similar to an implementation by Noel.

First, some imports and simple helper functions that allow me to work with native Clojure sequences rather than having to go via Java methods

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))

(use 'clojure.contrib.duck-streams)
(use 'clojure.contrib.str-utils)

;; surely there must be a core function for this!
(defn has
  "true if val exists in coll, nil otherwise. Uses ="
  [coll val]
  (some #(= % val) coll))

;; http://markmail.org/message/56r3eflx4a6tasoe
(defn flatten
  "Flatten a list which may contain lists"
  [x]
  (let [s? #(instance? clojure.lang.Sequential %)]
    (filter
     (complement s?)
     (tree-seq s? seq x))))

;; Maybe these could be converted to macro's?
(defn get-atom-list
  "Get the atoms of a molecule as a Clojure sequence"
  [mol]
  (map #(identity %) (. mol atoms)))

(defn get-connected-atoms-list
  "Get the atoms connected to a specified atom of a molecule
   as a Clojure sequence"

  [mol atom]
  (map #(identity %) (. mol (getConnectedAtomsList atom))))

(defn getmol
  "Parse a SMILES string and return the molecule object. This is
   quite inefficient since each time it is called it creates a new
   parser object. Handy for quick testing"

  [smiles]
  (. (new SmilesParser (DefaultChemObjectBuilder/getInstance))  
     (parseSmiles smiles)))

These go into my user.clj, so are available when I start Clojure. With these in hand, we need a function which will get us the atoms in successive shells around a specified atom. In other words, atoms in the first shell are those connected directly to the specified atom. Atoms in the second shell lie two bonds away and so on. This is achieved by the following function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(defn get-shells
  "Get the atoms in successive shells surrounding a specified atom.
   The first shell equals the immediate neighbors, the second shell
   the neighbors at bond length 2 and so on. The return value is a list
   in which each element is a list of atoms. The first element of the
   top level list is the n'th shell. Note that the return value does not
   contain the starting atom (i.e., zeroth shell)"

  [mol start depth]
  (loop [shells (list (list start))
        visited (list)]
    (if (= (count shells) (+ 1 depth))
      (drop-last shells)
      (recur (conj shells
           (filter #(not (has visited %))
               (flatten (map #(get-connected-atoms-list mol %) (first shells)))))
         (set (flatten (list visited shells)))))))

Here, depth indicates how many shells we want to go out to. The return value is a list of lists, each sublist representing the atoms in a shell. The first element corresponds to the n’th shell. Note that the zeroth shell (i.e., the start atom) is not contained in the return value.

Before applying this function to all the atoms, we need to consider how we’re going to represent each shell. For example ECFP’s consider various atom properties (degree, H-bonding ability etc). But a very simple procedure is to simply replace each atom in a shell with its atom type name, sort the names and join them into a single string. The end result would be, for a single atom, a list of strings corresponding to the shells around that atom.

First, we have a simple function to convert a list of atoms into a list of CDK atom type names

1
2
3
(defn atype [atoms]
  (map #(. % getAtomTypeName)
       atoms))

Using this, we can then use it to convert the lists of “atom shells” around a given atom to a list of atom type strings

1
2
3
4
5
(defn make-atom-fp
  [mol atom depth]
  (map #(str-join "-" (sort %))
       (map atype
           (get-shells mol atom depth))))

The return value is still a list of strings representing each shell around the specified atom. In practice this list of strings would likely be reduced to a single string or number. The last thing to do is to apply the function to all the atoms in a molecule (using a depth of 2 for example)

1
(map #(make-atom-fp mol % 2) (get-atom-list mol))

An example of using these functions is:

1
2
3
4
user=> (def mol (getmol "C(O)(F)CN"))
#'user/mol
user=> (map #(make-atom-fp mol % 2) (get-atom-list mol))
(("N.sp3" "C.sp3-F-O.sp3") ("C.sp3-F" "C.sp3") ("C.sp3-O.sp3" "C.sp3") ("F-O.sp3" "C.sp3-N.sp3") ("C.sp3" "C.sp3"))

Alternatively, we could reduce this to a list of strings, one for each atom, by simply joining the strings for each shell of a given atom

1
2
user=> (map #(str-join "-" %) (map #(make-atom-fp mol % 2) (get-atom-list mol)))
("N.sp3-C.sp3-F-O.sp3" "C.sp3-F-C.sp3" "C.sp3-O.sp3-C.sp3" "F-O.sp3-C.sp3-N.sp3" "C.sp3-C.sp3")

This is a very simple example, and as I’m still wrapping my head around Lisp, it took some time to write. But once you get the hang of loop and recur, things start flowing!

Manipulating SMARTS Queries

Yesterday, Andreas Maunz asked a question on the openbabel-discuss list:

… a possibility to combine two distinct smarts patterns. Of course, there is the comma OR operator (s1,s2), but I was thinking of a more sophisticated combination, where equal parts of s1 and s2 are “mapped onto each other” in a way such that merely the exclusive parts of s1 and s2 are connected via OR to the common core structure, where possible?

Fundamentally, a solution requires us to be able to manipulate and transform the SMARTS query in various ways. Operations of these types have been been discussed by Roger Sayle in a MUG presentation and these are available in the OpenEye tools. As far as I know, there are no open source implementations of such SMARTS optimizations (or manipulations). Later on in Andres’ thread, Andrew Dalke suggested that one could attack this by performing an MCSS calculation on two graphs representing the two queries, along with some appropriate normalizations.

It turns out that this type of manipulation isn’t too difficult in the CDK, thanks to Dazhi Jiao’s work on the JJTree based SMARTS parser. Since the parser generates an abstract syntax tree, the SmartQueryVisitor is able to walk the tree and generate an IQueryAtomContainer. Basically this is analogous to a molecule (i.e., IAtomContainer) and hence we can apply pre-exisiting code for substructure searches. Thus, to find the largest common query fragment between two SMARTS queries we can do:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import org.openscience.cdk.smiles.smarts.parser.SMARTSParser;
import org.openscience.cdk.smiles.smarts.parser.ASTStart;
import org.openscience.cdk.smiles.smarts.parser.ParseException;
import org.openscience.cdk.smiles.smarts.parser.visitor.SmartsQueryVisitor;
import org.openscience.cdk.isomorphism.matchers.IQueryAtomContainer;
import org.openscience.cdk.isomorphism.UniversalIsomorphismTester;
import org.openscience.cdk.interfaces.IAtom;
import org.openscience.cdk.interfaces.IAtomContainer;
import org.openscience.cdk.interfaces.IBond;
import org.openscience.cdk.exception.CDKException;

import java.util.List;

public class SMARTSManipulator {

    public SMARTSManipulator() {
    }

    private IQueryAtomContainer getQueryMolecule(String smarts) throws ParseException {
        SMARTSParser parser = new SMARTSParser(new java.io.StringReader(smarts));
        ASTStart ast = parser.Start();
        SmartsQueryVisitor visitor = new SmartsQueryVisitor();
        return (IQueryAtomContainer) visitor.visit(ast, null);
    }

    // helper to convert the IQueryAtomContainer to an IAtomContainer, to make the UIT
    // happy. This is OK, since the atoms and bonds in the IAtomContainer are still the query
    // atoms and query bonds
    private IAtomContainer convertToAtomContainer(IQueryAtomContainer queryContainer) {
        IAtomContainer x = queryContainer.getBuilder().newAtomContainer();
        for (IAtom atom : queryContainer.atoms()) x.addAtom(atom);
        for (IBond bond : queryContainer.bonds()) x.addBond(bond);
        return x;
    }

    public void smartsMCSS() throws ParseException, CDKException {

        IQueryAtomContainer queryTarget = getQueryMolecule("c1ccccc1C(=O)C*");
        IQueryAtomContainer queryQuery = getQueryMolecule("*C([N,S])C=O");


        boolean isSubgraph = UniversalIsomorphismTester.isSubgraph(convertToAtomContainer(queryTarget), queryQuery);
        System.out.println("isSubgraph = " + isSubgraph);
        System.out.println("");

        List<IAtomContainer> mcss = UniversalIsomorphismTester.getOverlaps(
                convertToAtomContainer(queryTarget),
                convertToAtomContainer(queryQuery));

        System.out.println("mcss.size() = " + mcss.size() + "\n");

        for (IAtomContainer aMcs : mcss) {
            System.out.println("aMcs.getAtomCount() = " + aMcs.getAtomCount());
            for (IAtom matchingAtom : aMcs.atoms())
                System.out.println("matchingAtom = " + matchingAtom);
            System.out.println("");
        }
    }

    public static void main(String[] args) throws ParseException, CDKException {
        SMARTSManipulator m = new SMARTSManipulator();
        m.smartsMCSS();
    }
}

Note that we convert the IQueryAtomContainer to IAtomContainer, since the UniversalIsomorphism tester expects that we are performing queries against an actual (i.e., real) molecule.

As expected the smaller query is not a subgraph of the larger query. But when we evaluate the MCSS, we do end up with a query fragment consisting of *CC=O.

Now this solution is just a quick hack to see if it could be done – it will fail on more general queries (such as C[#6]N and CCN) since we don’t perform any normalization. Furthermore, there are more sophisticated optimizations that could be done on the AST directly but that’s for another night