Archive for the ‘cheminformatics’ Category
With the profusion of chemical information on the web – in the form of chemical names, images of structures, specific codes (InChI etc), it’s sometimes very useful to be able to seamlessly retrieve some extra information while browsing a page that contains such entities. The usual way is to copy the InChI/SMILES/CAS/name string and paste into Pubchem, Chemspider and so on.
However, a much smoother way is now available via a Firefox extension, called NCATSFind, developed by my colleague. It’s a one click install and once installed, automatically identifies a variety of chemical id codes (CAS number, InChI, UNII) and when such entities are identified uses a variety of backend services to provide context. In addition, it has a cool feature that lets you select an image and generate a structure (using OSRA in the background).
Check out his blog post for more details.
Applications Invited for CSA Trust Grants for 2014.
The Chemical Structure Association (CSA) Trust is an internationally recognized organization established to promote the critical importance of chemical information to advances in chemical research. In support of its charter, the Trust has created a unique Grant Program and is currently inviting the submission of grant applications for 2014.
Purpose of the Grants:
The Grant Program has been created to provide funding for the career development of young researchers who have demonstrated excellence in their education, research or development activities that are related to the systems and methods used to store, process and retrieve information about chemical structures, reactions and compounds. One or more Grants will be awarded annually up to a total combined maximum of ten thousand U.S. dollars ($10,000). Grants are awarded for specific purposes, and within one year each grantee is required to submit a brief written report detailing how the grant funds were allocated. Grantees are also requested to recognize the support of the Trust in any paper or presentation that is given as a result of that support.
Who is Eligible?
Applicant(s), age 35 or younger, who have demonstrated excellence in their chemical information related research and who are developing careers that have the potential to have a positive impact on the utility of chemical information relevant to chemical structures, reactions and compounds, are invited to submit applications. While the primary focus of the Grant Program is the career development of young researchers, additional bursaries may be made available at the discretion of the Trust. All requests must follow the application procedures noted below and will be weighed against the same criteria.
Which Activities are Eligible?
Grants may be awarded to acquire the experience and education necessary to support research activities; e.g. for travel to collaborate with research groups, to attend a conference relevant to one’s area of research, to gain access to special computational facilities, or to acquire unique research techniques in support of one’s research
Applications must include the following documentation:
- A letter that details the work upon which the Grant application is to be evaluated as well as details on research recently completed by the applicant;
- The amount of Grant funds being requested and the details regarding the purpose for which the Grant will be used (e.g. cost of equipment, travel expenses if the request is for financial support of meeting attendance, etc.). The relevance of the above-stated purpose to the Trust’s objectives and the clarity of this statement are essential in the evaluation of the application);
- A brief biographical sketch, including a statement of academic qualifications;
- Two reference letters in support of the application. Additional materials may be supplied at the discretion of the applicant only if relevant to the application and if such materials provide information not already included in items 1-4. Three copies of the complete application document must be supplied for distribution to the Grants Committee.
Deadline for Applications:
Applications must be received no later than March 28, 2014. Successful applicants will be notified no later than May 2, 2014.
Address for Submission of Applications:
Three copies of the application documentation should be forwarded to: Bonnie Lawlor, CSA Trust Grant Committee Chair, 276 Upper Gulph Road, Radnor, PA 19087, USA. If you wish to enter your application by e-mail, please contact Bonnie Lawlor at email@example.com prior to submission so that she can contact you if the e-mail does not arrive.
I came across a recent paper from the Tropsha group that discusses the issue of modelability – that is, can a dataset (represented as a set of computed descriptors and an experimental endpoint) be reliably modeled. Obviously the definition of reliable is key here and the authors focus on a cross-validated classification accuracy as the measure of reliability. Furthermore they focus on binary classification. This leads to a simple definition of modelability – for each data point, identify whether it’s nearest neighbor is in the same class as the data point. Then, the ratio of number of observations whose nearest neighbor is in the same activity class to the number observations in that activity class, summed over all classes gives the MODI score. Essentially this is a statement on linear separability within a given representation.
The authors then go show a pretty good correlation between the MODI scores over a number of datasets and their classification accuracy. But this leads to the question – if one has a dataset and associated modeling tools, why compute the MODI? The authors state
we suggest that MODI is a simple characteristic that can be easily computed for any dataset at the onset of any QSAR investigation
I’m not being rigorous here, but I suspect for smaller datasets the time requirements for MODI calculations is pretty similar to building the models themselves and for very large datasets MODI calculations may take longer (due to the requirement of a distance matrix calculation – though this could be alleviated using ANN or LSH). In other words – just build the model!
Another issue is the relation between MODI and SVM classification accuracy. The key feature of SVMs is that they apply the kernel trick to transform the input dataset into a higher dimensional space that (hopefully) allows for better separability. As a result MODI calculated on the input dataset should not necessarily be related to the transformed dataset that is actually operated on by the SVM. In other words a dataset with poor MODI could be well modeled by an SVM using an appropriate kernel.
The paper, by definition, doesn’t say anything about what model would be best for a given dataset. Furthermore, it’s important to realize that every dataset can be perfectly predicted using a sufficiently complex model. This is also known as an overfit model. The MODI approach to modelability avoids this by considering a cross-validated accuracy measure.
One application of MODI that does come to mind is for feature selection - identify a descriptor subset that leads to a predictive model. This is justified by the observed correlation between the MODI scores and the observed classification rates and would avoid having to test feature subsets with the modeling algorithm itself. An alternative application (as pointed out by the authors) is to identify subsets of the data that exhibit a good MODI score, thus leading to a local QSAR model.
More generally, it would be interesting to extend the concept to regression models. Intuitively, a dataset that is continuous in a given representation should have a better modelability than one that is discontinuous. This is exactly the scenario that can be captured using the activity landscape approach. Sometime back I looked at characterizing the roughness of an activity landscape using SALI and applied it to the feature selection problem – being able to correlate such a measure to predictive accuracy of models built on those datasets could allow one to address modelability (and more specifically, what level of continuity should a landscape present to be modelable) in general.
Version 3.5.2 of the fingerprint package has been pushed to CRAN. This update includes a contribution from Abhik Seal that significantly speeds up similarity matrix calculations using the Tanimoto metric.
His patch led to a 10-fold improvement in running time. However his code involved the use of nested for loops in R. This is a well known bottleneck and most idiomatic R code replaces for loops with a member of the sapply/lapply/tapply family. In this case however, it was easier to write a small piece of C code to perform the loops, resulting in a 4- to 6-fold improvement over Abhiks observed running times (see figure summarizing Tanimoto similarity matrix calculation for 1024 bit fingerprints, with 256 bits randomly selected to be 1). As always, the latest code is available on Github.
I’ve pushed updates to the rcdklibs and rcdk packages that support cheminformatics in R using the CDK. The new versions employ the latest CDK master, which as Egon pointed out has significantly fewer bugs, and thanks to Jon, improved performance. New additions to the package include support for the LINGO and Signature fingerprinters (you’ll need the latest version of fingerprint).