Archive for the ‘circular’ tag
I’ve just updated the fingerprint package to v3.5.0 (should show up on CRAN shortly, or else you can get it directly from my Github repository). The main update in this version is better support for feature,count type fingerprints. An example would be ECFP or signature fingerprints. In these types of fingerprints, the output is usually a set of (integer or long) hash values or else structural fragments along with their count of occurrences.
The updated package now provides an S4 class to represent features and their counts. An example of this class is
f1 <- new("feature",
The package provides getters and setters for these objects, allow you to get or set the feature and the count.
> feature(f1) <- 'ABCD'
> count(f1) <- 12
Using this class, feature,count fingerprints are now represented as objects of class featvec. For these fingerprints, instead of bits, one obtains a list of feature objects. For fingerprints read from files that provide the hashed version of the underlying structure (or neighborhood etc), the numeric hashes are read in as features, with a default count of 1. The distance method has also been updated to evaluate similarities for feature,count fingerprints, though currently it does not use the count in the similarity calculation.
As an example, consider a set of ECFP’s available from here
> fps <- fp.read('http://pastebin.com/raw.php?i=gHjTQNKP', lf=ecfp.lf, binary=FALSE)
name = mol01
source = ecfp.lf
features = 17:1 0:1 16:1 3:1 1:1 1747237384:1 1499521844:1 -1539132615:1 1294255210:1 332760439:1 -1549163031:1 1035613116:1 1618154665:1 590925877:1 1872154524:1 -1143715940:1 203677720:1 -1272768868:1 136120670:1 136597326:1 -1460348762:1 -1262922302:1 -1201618245:1 -402549409:1 -1270820019:1 929601590:1 -1597477966:1 -1274743746:1 -1155471474:1 1258428229:1 -1838187238:1 -798628285:1 -1773728142:1 -773983804:1 -453677277:1 1674451008:1 65948508:1 991735244:1 -1412946825:1 846704869:1 -2103621484:1 -886204842:1 1725648567:1 -353343892:1 -585443181:1 -533273616:1 2031084733:1 -801248129:1 1752802620:1 -976015189:1 -992213424:1 2109043264:1 -790336137:1 630139722:1 -505031736:1 -1427697183:1 -2090462286:1 -1724769936:1
> distance(fps[], fps[])
> distance(fps[], fps[])
One of the things I’m liking about Clojure is that it can be used as a quick prototyping language, a lot like Python. This is quite handy when playing with the CDK, as I can avoid a lot of the verbosity of Java code. As an example, I put together a very simplistic circular fingerprint (exemplified by molprint2d and ECFP’s). In this case, the fingerprint is simply composed of atom type names and is quite similar to an implementation by Noel.
First, some imports and simple helper functions that allow me to work with native Clojure sequences rather than having to go via Java methods
(import '(org.openscience.cdk.smiles SmilesParser))
(import '(org.openscience.cdk DefaultChemObjectBuilder))
;; surely there must be a core function for this!
"true if val exists in coll, nil otherwise. Uses ="
(some #(= % val) coll))
"Flatten a list which may contain lists"
(let [s? #(instance? clojure.lang.Sequential %)]
(tree-seq s? seq x))))
;; Maybe these could be converted to macro's?
"Get the atoms of a molecule as a Clojure sequence"
(map #(identity %) (. mol atoms)))
"Get the atoms connected to a specified atom of a molecule
as a Clojure sequence"
(map #(identity %) (. mol (getConnectedAtomsList atom))))
"Parse a SMILES string and return the molecule object. This is
quite inefficient since each time it is called it creates a new
parser object. Handy for quick testing"
(. (new SmilesParser (DefaultChemObjectBuilder/getInstance))
These go into my user.clj, so are available when I start Clojure. With these in hand, we need a function which will get us the atoms in successive shells around a specified atom. In other words, atoms in the first shell are those connected directly to the specified atom. Atoms in the second shell lie two bonds away and so on. This is achieved by the following function
"Get the atoms in successive shells surrounding a specified atom.
The first shell equals the immediate neighbors, the second shell
the neighbors at bond length 2 and so on. The return value is a list
in which each element is a list of atoms. The first element of the
top level list is the n'th shell. Note that the return value does not
contain the starting atom (i.e., zeroth shell)"
[mol start depth]
(loop [shells (list (list start))
(if (= (count shells) (+ 1 depth))
(recur (conj shells
(filter #(not (has visited %))
(flatten (map #(get-connected-atoms-list mol %) (first shells)))))
(set (flatten (list visited shells)))))))
Here, depth indicates how many shells we want to go out to. The return value is a list of lists, each sublist representing the atoms in a shell. The first element corresponds to the n’th shell. Note that the zeroth shell (i.e., the start atom) is not contained in the return value.
Before applying this function to all the atoms, we need to consider how we’re going to represent each shell. For example ECFP’s consider various atom properties (degree, H-bonding ability etc). But a very simple procedure is to simply replace each atom in a shell with its atom type name, sort the names and join them into a single string. The end result would be, for a single atom, a list of strings corresponding to the shells around that atom.
First, we have a simple function to convert a list of atoms into a list of CDK atom type names
(defn atype [atoms]
(map #(. % getAtomTypeName)
Using this, we can then use it to convert the lists of “atom shells” around a given atom to a list of atom type strings
[mol atom depth]
(map #(str-join "-" (sort %))
(get-shells mol atom depth))))
The return value is still a list of strings representing each shell around the specified atom. In practice this list of strings would likely be reduced to a single string or number. The last thing to do is to apply the function to all the atoms in a molecule (using a depth of 2 for example)
(map #(make-atom-fp mol % 2) (get-atom-list mol))
An example of using these functions is:
user=> (def mol (getmol "C(O)(F)CN"))
user=> (map #(make-atom-fp mol % 2) (get-atom-list mol))
(("N.sp3" "C.sp3-F-O.sp3") ("C.sp3-F" "C.sp3") ("C.sp3-O.sp3" "C.sp3") ("F-O.sp3" "C.sp3-N.sp3") ("C.sp3" "C.sp3"))
Alternatively, we could reduce this to a list of strings, one for each atom, by simply joining the strings for each shell of a given atom
user=> (map #(str-join "-" %) (map #(make-atom-fp mol % 2) (get-atom-list mol)))
("N.sp3-C.sp3-F-O.sp3" "C.sp3-F-C.sp3" "C.sp3-O.sp3-C.sp3" "F-O.sp3-C.sp3-N.sp3" "C.sp3-C.sp3")