So much to do, so little time

Trying to squeeze sense out of chemical data

Archive for the ‘Uncategorized’ Category

SLAS 2017: Let There Be Light: Informatics Approaches to Exploring the Dark Genome

without comments

I’m organizing a symposium at the 2017 SLAS meeting in Washington D.C in the Data Analysis and Informatics track. The topic of the symposium focuses on informatics approaches that shed light and explore the dark genome. The description is given below, and I hope you’ll consider submitting an abstract.

With efforts such as the NIH-funded Illuminating the Druggable Genome (IDG) program, there is great interest and a pressing need to understand the “dark genome” — the subset of genes that have little to no information about them in the literature or databases. This session will focus on current efforts by members of the IDG program and the community in general on developing informatics resources for data aggregation and integration, target prioritization and platform development. In addition, topics such as characterization of druggability and novel approaches to connecting heterogeneous datasets that allow us to shed light on the dark genome will be considered.

The deadline is Aug 8, 2016 and you can submit an abstract here.

Written by Rajarshi Guha

May 11th, 2016 at 3:46 pm

Posted in Uncategorized

Call For Papers: Shedding Light on the Dark Genome – Methods, Tools & Case Studies

without comments

252nd ACS National Meeting
Philadelphia, Aug 21-25, 2016
CINF Division

Dear Colleagues, we are organizing a symposium at the Fall ACS meeting in Philadelphia focusing on computational, experimental and hybrid approaches to characterizing the unstudied and understudied druggable genome.  In 2014 the NIH initiated a program titled, “Illuminating the Druggable Genome” (IDG) with the goal of improving our understanding of the properties and functions of proteins that are currently unannotated within the four most commonly drug-targeted protein families – GPCRs, ion channels, nuclear receptors and kinases. As part of this program a Knowledge Management Center (KMC) was formed, as a collaboration between six academic center, who’s goal was to develop an integrative informatics platform to collect data, develop data driven prioritization schemes, analytical methods  and disseminate standardized/annotated information related to the unannotated proteins in the four gene families of interest.

In this symposium, members of the various components of the IDG program will present the results of ongoing work related to experimental methods, target prioritization, data aggregation and platform development. In addition, we welcome contributions related to the identification of druggable targets, approaches to quantify druggability and novel approaches to integrating disparate data source with the goal of shedding light on the “dark genome”

The deadline for abstract submissions is March 29, 2016. All abstracts should be submitted via MAPS at If you have any questions feel free to contact  Tudor or myself

Rajarshi Guha

Tudor Oprea
University of New Mexico

Written by Rajarshi Guha

February 22nd, 2016 at 4:00 pm

Posted in Uncategorized

Tagged with , ,

Applications Invited for CSA Trust Grant for 2016

without comments

The Chemical Structure Association (CSA) Trust is an internationally recognized organization established to promote the critical importance of chemical information to advances in chemical research. In support of its charter, the Trust has created a unique Grant Program and is now inviting the submission of grant applications for 2016.

Purpose of the Grants:

The Grant Program has been created to provide funding for the career development of young researchers who have demonstrated excellence in their education, research or development activities that are related to the systems and methods used to store, process and retrieve information about chemical structures, reactions and compounds. One or more Grants will be awarded annually up to a total combined maximum of ten thousand U.S. dollars ($10,000). Grantees have the option of payments being made in U.S. dollars or in British Pounds equivalent to the U.S. dollar amount. Grants are awarded for specific purposes, and within one year each grantee is required to submit a brief written report detailing how the grant funds were allocated. Grantees are also requested to recognize the support of the Trust in any paper or presentation that is given as a result of that support.

Who is Eligible?

Applicant(s), age 35 or younger, who have demonstrated excellence in their chemical information related research and who are developing careers that have the potential to have a positive impact on the utility of chemical information relevant to chemical structures, reactions and compounds, are invited to submit applications. While the primary focus of the Grant Program is the career development of young researchers, additional bursaries may be made available at the discretion of the Trust. All requests must follow the application procedures noted below and will be weighed against the same criteria.

Which Activities are Eligible?

Grants may be awarded to acquire the experience and education necessary to support research activities; e.g. for travel to collaborate with research groups, to attend a conference relevant to one’s area of research (including the presentation of an already-accepted research paper), to gain access to special computational facilities, or to acquire unique research techniques in support of one’s research.

Application Requirements:

Applications must include the following documentation:

  1. A letter that details the work upon which the Grant application is to be evaluated as well as details on research recently completed by the applicant;
  2. The amount of Grant funds being requested and the details regarding the purpose for which the Grant will be used (e.g. cost of equipment, travel expenses if the request is for financial support of meeting attendance, etc.). The relevance of the above-stated purpose to the Trust’s objectives and the clarity of this statement are essential in the evaluation of the application);
  3. A brief biographical sketch, including a statement of academic qualifications;
  4. Two reference letters in support of the application.  Additional materials may be supplied at the discretion of the applicant only if relevant to the application and if such materials provide information not already included in items 1-4.   A copy of the completed application document must be supplied for distribution to the Grants Committee and can be submitted via regular mail or e-mail to the Committee Chair (see contact information below).

Deadline for Applications:

Application deadline for the 2016 Grant is March 25, 2016. Successful applicants will be notified no later than May 2, 2016.

Address for Submission of Applications:

The application documentation should be forwarded via post or email to: Bonnie Lawlor, CSA Trust Grant Committee Chair, 276 Upper Gulph Road, Radnor, PA 19087, USA. If you wish to enter your application by e-mail, please contact Bonnie Lawlor at prior to submission so that she can contact you if the e-mail does not arrive.

Written by Rajarshi Guha

January 26th, 2016 at 7:36 pm

Posted in Uncategorized

Tagged with ,

Learning Representations – Digits, Cats and Now Molecules

with 3 comments

Deep learning has been getting some press in the last few months, especially with the Google paper on recognizing cats (amongst other things) from Youtube videos. The concepts underlying this machine learning approach have been around for many years, though recent work by Hinton and others have led to fast implementations of the algorithms as well as better theoretical understanding.

It took me a while to realize that deep learning is really about learning an optimal, abstract representation in an unsupervised fashion (in the general case), given a set of input features. The learned representation can be then used as input to any classifier. A key aspect to such learned representations is that they are, in general, agnostic with respect to the final task for which they are trained. In the Google “cat” project this meant that the final representation developed the concept of cats as well as faces. As pointed out by a colleague, Bengio et al have published an extensive and excellent review of this topic and Baldi also has a nice review on deep learning.

In any case, it didn’t take too long for this technique to be applied to chemical data. The recent Merck-Kaggle challenge was won by a group using deep learning, but neither their code nor approach was publicly described. A more useful discussion of deep learning in cheminformatics was recently published by Lusci et al where they develop a DAG representation of structures that is then fed to a recursive neural network (RNN). They then use the resultant representation and network model to predict aqueous solubility.

A key motivation for the new graph representation and deep learning approach was the observation

one cannot be certain that the current molecular descriptors capture all the relevant properties required for solubility prediction

A related motivation was that they desired to apply deep learning methods directly to the molecular graph, which in general, is of variable size compared to fixed length representations (fingerprints or descriptor sets). It’s an interesting approach and you can read the paper for more details, but a few things caught my eye:

  • The motivation for the DAG based structure description didn’t seem very robust. Shouldn’t a learned representation be discoverable from a set of real-valued molecular descriptors (or even fingerprints)? While it is possible that all the physical aspects of aquous solubility may not be captured in the current repetoire of molecular descriptors, I’d think that most aspects are. Certainly some characterizations may be too time consuming (QM descriptors) for a cheminformatics setting.
  • The results are not impressive, compared to pre-existing model for the datasets they used. This is all the more surprising given that the method is actually an ensemble of RNN’s. For example, in Table 2 the best RNN model has an R2 of 0.92 versus 0.91 for the pre-existing model (a 2D kernel). But R2 is usually a good metric for non-linear regression. But even the RMSE is only 0.03 units better than the pre-existing model.However, it is certainly true that the unsupervised nature of the representation learning step is quite attractive – this is evident in the case of the intrinsic solubility dataset, where they achieve similar results to the prior model. But the prior model employed a manually selected set of topological descriptors.
  • It would’ve been very interesting to look at the transferabilty of the learned representation by using it to predict another physical property unrelated (at least directly) to solubility.

One characteristic of deep learning methods is that they work better when provided a lot of training data. With the exception of the Huuskonen dataset (4000 molecules), none of the datasets used were very large. If training set size is really an issue, the Burnham solubility dataset with 57K observations would have been a good benchmark.

Overall, I don’t think the actual predictions are too impressive using this approach. But the more important aspect of the paper is the ability to learn an internal representation in an unsupervised manner and the promise of transferability of such a representation. In a way, it’d be interesting to see what an abstract representation of a molecule could be like, analogous to what a deep network thinks a cat looks like.

Written by Rajarshi Guha

July 2nd, 2013 at 2:41 am

CINF Webinar: Practical cheminformatics workflows with mobile apps

without comments

I’m pleased to announce that the ACS Division of Chemical Information will be hosting a series of free webinars on topics related to chemical information. The webinars will be open to everybody and our first speaker in this series will be Dr. Alex Clark, who’ll be talking about cheminformatics workflows and mobile applications. More details below or at

Webinar: Practical cheminformatics workflows with mobile apps
Date: October 3, 2012
Time: 11 am Eastern time (US)
View the Webinar:


In recent years smartphones and tablets have attained sufficient power and sophistication to replace conventional desktop and laptop computers for many tasks. Chemistry software is late to the party, but rapidly catching up. This webinar will explore some of the cheminformatics functionality that can currently be performed using mobile apps. A number of workflow scenarios will be discussed, such as: creating and maintaining chemical data (molecules, reactions, numbers & text); searching chemical databases and utilising the results; structure-aware lab notebooks; visualisation and structure-activity analysis; property calculation using remote webservices; and a multitude of ways to share data collaboratively, and integrate modular apps within distributed and heterogeneous workflows.


Alex M. Clark graduated from the University of Auckland, New Zealand, with a Ph.D. in synthetic organometallic chemistry, then went on to work in computational chemistry. His chemistry background spans both the lab bench and development of software for a broad variety of 2D and 3D computer aided molecular design algorithms and user interfaces. He is the founder of Molecular Materials Informatics, Inc., which is dedicated to producing next-generation cheminformatics software for emerging platforms such as mobile devices and cloud computing environments.

Written by Rajarshi Guha

September 27th, 2012 at 10:00 pm

Posted in Uncategorized

Tagged with , ,