So much to do, so little time

Trying to squeeze sense out of chemical data

Search Result for rest — 108 articles

Learning Representations – Digits, Cats and Now Molecules

with 3 comments

Deep learning has been getting some press in the last few months, especially with the Google paper on recognizing cats (amongst other things) from Youtube videos. The concepts underlying this machine learning approach have been around for many years, though recent work by Hinton and others have led to fast implementations of the algorithms as well as better theoretical understanding.

It took me a while to realize that deep learning is really about learning an optimal, abstract representation in an unsupervised fashion (in the general case), given a set of input features. The learned representation can be then used as input to any classifier. A key aspect to such learned representations is that they are, in general, agnostic with respect to the final task for which they are trained. In the Google “cat” project this meant that the final representation developed the concept of cats as well as faces. As pointed out by a colleague, Bengio et al have published an extensive and excellent review of this topic and Baldi also has a nice review on deep learning.

In any case, it didn’t take too long for this technique to be applied to chemical data. The recent Merck-Kaggle challenge was won by a group using deep learning, but neither their code nor approach was publicly described. A more useful discussion of deep learning in cheminformatics was recently published by Lusci et al where they develop a DAG representation of structures that is then fed to a recursive neural network (RNN). They then use the resultant representation and network model to predict aqueous solubility.

A key motivation for the new graph representation and deep learning approach was the observation

one cannot be certain that the current molecular descriptors capture all the relevant properties required for solubility prediction

A related motivation was that they desired to apply deep learning methods directly to the molecular graph, which in general, is of variable size compared to fixed length representations (fingerprints or descriptor sets). It’s an interesting approach and you can read the paper for more details, but a few things caught my eye:

  • The motivation for the DAG based structure description didn’t seem very robust. Shouldn’t a learned representation be discoverable from a set of real-valued molecular descriptors (or even fingerprints)? While it is possible that all the physical aspects of aquous solubility may not be captured in the current repetoire of molecular descriptors, I’d think that most aspects are. Certainly some characterizations may be too time consuming (QM descriptors) for a cheminformatics setting.
  • The results are not impressive, compared to pre-existing model for the datasets they used. This is all the more surprising given that the method is actually an ensemble of RNN’s. For example, in Table 2 the best RNN model has an R2 of 0.92 versus 0.91 for the pre-existing model (a 2D kernel). But R2 is usually a good metric for non-linear regression. But even the RMSE is only 0.03 units better than the pre-existing model.However, it is certainly true that the unsupervised nature of the representation learning step is quite attractive – this is evident in the case of the intrinsic solubility dataset, where they achieve similar results to the prior model. But the prior model employed a manually selected set of topological descriptors.
  • It would’ve been very interesting to look at the transferabilty of the learned representation by using it to predict another physical property unrelated (at least directly) to solubility.

One characteristic of deep learning methods is that they work better when provided a lot of training data. With the exception of the Huuskonen dataset (4000 molecules), none of the datasets used were very large. If training set size is really an issue, the Burnham solubility dataset with 57K observations would have been a good benchmark.

Overall, I don’t think the actual predictions are too impressive using this approach. But the more important aspect of the paper is the ability to learn an internal representation in an unsupervised manner and the promise of transferability of such a representation. In a way, it’d be interesting to see what an abstract representation of a molecule could be like, analogous to what a deep network thinks a cat looks like.

Written by Rajarshi Guha

July 2nd, 2013 at 2:41 am

Publications

without comments

The ATR Inhibitor VE-821 in Combination with the Novel Topoisomerase I Inhibitor LMP-400 Selectively Kills Cancer Cells by Disabling DNA Replication Initiation and Fork Elongation
Rozenn, J.; S.E., M.; Guha, R.; Ormanoglu, P.; Pfister, T.; Morris, J.; Doroshow, J.; Pommier, Y.
Cancer Research, 2014, submitted
Camptothecin, a specific topoisomerase I inhibitor is a potent anticancer drug, especially against solid tumors. This agent produces well-characterized double-strand breaks upon collision of replication forks with topoisomerase I cleavage complexes. In an attempt to improve its efficacy, we conducted a synthetic lethal siRNA screening using a library that targets nearly 7000 human genes. Depletion of ATR, the main transducer of replication stress-induced DNA damage response exacerbated cytotoxic response to both camptothecin and the indenoisoquinoline LMP-400, a novel class of topoisomerase inhibitors in clinical trial. Inhibition of ATR by the recently developed specific inhibitor VE-821 induced synergistic antiproliferative activity when combined with either topoisomerase inhibitor. Cytotoxicity induced by the combination with LMP-400 was greater than with camptothecin. Using single cell analysis and DNA fiber spread, we show that VE-821 abrogated the S-phase checkpoint, restored origin firing and replication fork progression. Moreover, the combination of a topoisomerase inhibitor with VE-821 inhibited the phosphorylation of ATR and ATR-mediated Chk1 phosphorylation but strongly induced ?H2AX. Single cell analysis revealed that ?H2AX pattern changed overtime from well-defined focus to a pan-nuclear staining. The change in ?H2AX pattern can be useful as a predictive biomarker to evaluate the efficacy of therapy. The key implication of our work is the clinical rationale it provides to evaluate the combination of indenoisoquinoline topoisomerase I inhibitors with ATR inhibitors.
On the Validity versus Utility of Activity Landscapes: Are All Activity Cliffs Statistically Significant?
Guha, R.; Medina-Franco, J.L.
J. Cheminf., 2014, in press
An Overview of the Challenges in Designing, Integrating, and Delivering BARD: A Public Chemical-Biology Resource and Query Portal for Multiple Organizations, Locations, and Disciplines
de Souza, A.; Bittker, J.; Lahr, D.; Brudz, S.; Chatwin, S.; Oprea, T.I.; Waller, A.; Yang, A.; Southall, N.; Guha, R.; Schurer, S.; Vempati, U.; Southern, M.R.; Dawson, E.S.; Clemons, P.A.; Chung, T.D.Y.
J. Biomol. Screen., 2014, in press
High-throughput combinatorial screening identifies drugs that cooperate with ibrutinib to kill ABC diffuse large B cell lymphoma cells
Mathews, L.; Guha, R.; Shinn, P.; Young, R.A.; Keller, J.; Liu, D.; Goldlust, I.S.; Yasgar, A.; McKnight, C.; Boxer, M.B.; Duveau, D.; Jiang, J.K.; Michael, S.; Mierzwa, T.; Huang, W.; Walsh, M.J.; Mott, B.T.; Patel, P.R.; Leister, W.; Maloney, D.J.; LeClair, C.A.; Rai, G.; Jadhav, A.; Peyser, B.D.; Austin, C.P.; Martin, S.; Simeonov, A.; Ferrer, M.; Staudt, L.M.; Thomas, C.J.
Proc. Nat. Acad. Sci., 2013, in press

[ Abstract ]
[DOI 10.1073/pnas.1311846111 ]

The clinical development of drug combinations is typically achieved through trial-and-error or via insight gained through a detailed molecular understanding of dysregulated signaling pathways in a specific cancer type. Unbiased small molecule combination (matrix) screening represents a high-throughput means to explore hundreds and even thousands of drug-drug pairs for potential investigation and translation. Here, we describe a high-throughput screening platform capable of testing compounds in pair-wise matrix blocks for the rapid and systematic identification of synergistic, additive and antagonistic drug combinations. Experimental details are provided for this platform including the software codes for a novel compound dispensing methodology and a web-based data interface. We utilize this platform to conduct a combination screen to determine drug-drug combinations for the Bruton’s tyrosine kinase (BTK) inhibitor ibrutinib (PCI-32765) against the activated B-cell-like subtype of diffuse large B-cell lymphoma (ABC DLBCL). The results of this study highlight a striking level of synergy/additivity between ibrutinib and inhibitors of the PI3K-AKT-mTOR signaling cascade including the PI3K inhibitor BKM-120, the AKT inhibitor MK-2206 and the mTOR inhibitor everolimus. We also found that ibrutinib had strong combination responses with chemotherapeutic components of the current standards of care for DLBCL including doxorubicin, gemcitabine and docetaxel.
Inhibition of Ceramide Metabolism Sensitizes Human Leukemia Cells to Inhibition of BCL2-like Proteins
Casson, L.; Howell, L.; Mathews, L.A.; Ferrer, M.; Southall, N.; Guha, R.; Keller, J.M.; Thomas, C.; Varmus, H.; Siskind, L.J.; Beverly, L.J.
PLoS One, 2013, 8, e54525

[ Abstract ]
[ Link ]

The identification of novel combinations of effective cancer drugs is required for the successful treatment of cancer patients for a number of reasons. First, many “cancer specific” therapeutics display detrimental patient side-effects and second, there are almost no examples of single agent therapeutics that lead to cures. One strategy to decrease both the effective dose of individual drugs and the potential for therapeutic resistance is to combine drugs that regulate independent pathways that converge on cell death. BCL2-like family members are key proteins that regulate apoptosis. We conducted a screen to identify drugs that could be combined with an inhibitor of anti-apoptotic BCL2-like proteins, ABT-263, to kill human leukemia cells lines. We found that the combination of D,L-threo-1-phenyl-2-decanoylamino-3-morpholino-1-propanol (PDMP) hydrochloride, an inhibitor of glucosylceramide synthase, potently synergized with ABT-263 in the killing of multiple human leukemia cell lines. Treatment of cells with PDMP and ABT-263 led to dramatic elevation of two pro-apoptotic sphingolipids, namely ceramide and sphingosine. Furthermore, treatment of cells with the sphingosine kinase inhibitor, SKi-II, also dramatically synergized with ABT-263 to kill leukemia cells and similarly increased ceramides and sphingosine. Data suggest that synergism with ABT-263 requires accumulation of ceramides and sphingosine, as AMP-deoxynojirimycin, (an inhibitor of the glycosphingolipid pathway) did not elevate ceramides or sphingosine and importantly did not sensitize cells to ABT-263 treatment. Taken together, our data suggest that combining inhibitors of anti-apoptotic BCL2-like proteins with drugs that alter the balance of bioactive sphingolipids will be a powerful combination for the treatment of human cancers.
Profile of the GSK Published Protein Kinase Inhibitor Set Across ATP-dependent and-independent Luciferases: Implications for Reporter-gene Assays
Dranchak, P.; MacArthur, R.; Guha, R.; Zuercher, W.J.; Drewry, D.H.; Auld, D.S.; Inglese, J.
PLoS One, 2013, 8,
Genome-wide high-content RNAi screens identify regulators of Parkin upstream of mitophagy
Hasson, S.; Kane, L.; Sliter, D.; Hessa, T.; Wang, C.; Buehler, E.; Guha, R.; Martin, S.; Yamano, K.; Huang, C.H.; Heman-Ackah, S.; Youle, R.
Nature, 2013, 504, 291-295

[ Abstract ]
[DOI 10.1038/nature12748 ]

An increasing body of evidence points to mitochondrial dysfunction as a contributor to the molecular pathogenesis of neurodegenerative diseases such as Parkinson’s. Recent studies of the PD-associated genes PINK1 and Parkin suggest that they may act in a quality control pathway preventing the accumulation of dysfunctional mitochondria. Here we elucidate regulators impacting Parkin translocation to damaged mitochondria with genome-wide siRNA screens coupled to high-content microscopy. Screening yielded gene candidates involved in diverse cellular processes that were subsequently validated in confirmatory assays. This led to characterization of TOMM7, as essential for stabilizing Pink1 on the outer mitochondrial membrane following mitochondrial damage. Additionally, we discovered HSPA1L (HSP70 family member) and BAG4 play mutually opposing roles in the regulation of Parkin translocation. The screens also revealed that SIAH3, found to localize to mitochondria, inhibits Pink1 accumulation after mitochondrial insult, reducing Parkin translocation. Overall, our screens provide a rich resource to understand mitochondrial quality control.
What are we “tweeting” About Obesity? Mapping Tweets with Topic Modeling and Geographic Information System
Ghosh, D.; Guha, R.
Cartography and GIS, 2013, 4o, 90-102

[ Abstract ]
[DOI 10.1080/15230406.2013.776210 ]

Public health related tweets are difficult to identify in large conversational datasets like Twitter.com. Even more challenging is the visualization and analyses of the spatial patterns encoded in tweets. This study has the following objectives: how can topic modeling be used to identify relevant public health topics such as obesity on Twitter.com? What are the common obesity related themes? What is the spatial pattern of the themes? What are the research challenges of using large conversational datasets from social networking sites? Obesity is chosen as a test theme to demonstrate the effectiveness of topic modeling using Latent Dirichlet Allocation (LDA) and spatial analysis using Geographic Information System (GIS). The dataset is constructed from tweets (originating from the United States) extracted from Twitter.com on obesity-related queries. Examples of such queries are `food deserts’, `fast food’, and `childhood obesity’. The tweets are also georeferenced and time stamped. Three cohesive and meaningful themes such as `childhood obesity and schools’, `obesity prevention’, and `obesity and food habits’ are extracted from the LDA model. The GIS analysis of the extracted themes show distinct spatial pattern between rural and urban areas, northern and southern states, and between coasts and inland states. Further, relating the themes with ancillary datasets such as US census and locations of fast food restaurants based upon the location of the tweets in a GIS environment opened new avenues for spatial analyses and mapping. Therefore the techniques used in this study provide a possible toolset for computational social scientists in general, and health researchers in specific, to better understand health problems from large conversational datasets.
Targeting IRAK1 as a Novel Therapeutic Approach for Myelodysplastic Syndrome
Rhyasen, G.W.; Bolanos, L.; Fang, J.; Rasch, C.; Jerez, A.; Varney, M.; Wunderlicj, M.; Rigolino, C.; Mathews, L.; Ferrer, M.; Southall, N.; Guha, R.; Keller, J.; Thomas, C.; Beverly, L.J.; Agostino, C.; Oliva, E.N.; Cuzzola, M.; Maciejewski, J.P.; Mulloy, J.C.; Starczynowski, D.T.
Cancer Cell, 2013, 24, 90-104

[ Abstract ]
[DOI 10.1016/j.ccr.2013.05.006 ]

Myelodysplastic syndromes (MDSs) arise from a defective hematopoietic stem/progenitor cell. Consequently, there is an urgent need to develop targeted therapies capable of eliminating the MDS-initiating clones. We identified that IRAK1, an immune-modulating kinase, is overexpressed and hyperactivated in MDSs. MDS clones treated with a small molecule IRAK1 inhibitor (IRAK1/4-Inh) exhibited impaired expansion and increased apoptosis, which coincided with TRAF6/NF-?B inhibition. Suppression of IRAK1, either by RNAi or with IRAK1/4-Inh, is detrimental to MDS cells, while sparing normal CD34+ cells. Based on an integrative gene expression analysis, we combined IRAK1 and BCL2 inhibitors and found that cotreatment more effectively eliminated MDS clones. In summary, these findings implicate IRAK1 as a drugable target in MDSs.
Large-Scale Screening Identifies a Novel microRNA, miR-15a-3p, which Induces Apoptosis in Human Cancer Cell Lines
Druz, A.; Chen, Y.C.; Guha, R.; Betenbaugh, M.; Martin, S.; Shiloaoch, J.
RNAi Biology, 2013, 10, 1-14
MicroRNAs (miRNAs) have been found to be involved in cancer initiation, progression and metastasis and, as such, have
been suggested as tools for cancer detection and therapy. In this work, a large-scale screening of the complete miRNA
mimics library demonstrated that hsa-miR-15a-3p had a pro-apoptotic role in the following human cancer cells: heLa,
Aspc-1, MDA-MB-231, KB3, Me180, hcT-116 and A549. MiR-15a-3p is a novel member of the pro-apoptotic miRNA cluster,
miR-15a/16, which was found to activate caspase-3/7 and to cause viability loss in B/cMBA.Ov cells during preliminary
screening. subsequent microarrays and bioinformatics analyses identified the following four anti-apoptotic genes: bcl2l1,
naip5, fgfr2 and mybl2 as possible targets for the mmu-miR-15a-3p in B/cMBA.Ov cells. Follow-up studies confirmed the
pro-apoptotic role of hsa-miR-15a-3p in human cells by its ability to activate caspase-3/7, to reduce cell viability and to
inhibit the expression of bcl2l1 (bcl-x
L
) in heLa and Aspc-1 cells. MiR-15-3p was also found to reduce viability in heK293,
MDA-MB-231, KB3, Me180, hcT-116 and A549 cell lines and, therefore, may be considered for apoptosis modulating
therapies in cancers associated with high Bcl-x
L
expression (cervical, pancreatic, breast, lung and colorectal carcinomas).
The capability of hsa-miR-15a-3p to induce apoptosis in these carcinomas may be dependent on the levels of Bcl-x
L
expression. The use of endogenous inhibitors of bcl-x
L
and other anti-apoptotic genes such as hsa-miR-15a-3p may
provide improved options for apoptosis-modulating therapies in cancer treatment compared with the use of artificial
antisense oligonucleotides
Cisplatin Sensitivity Mediated by WEE1 and CHK1 is Mediated by miR-155 and the miR-15 Family
Pouliot, L.M.; Chen, Y.-C.; Bai, J.; Guha, R.; Martin, S.E.; Gottesman, M.M.; Hall, M.D.
Cancer Cell, 2012, 72, 5945-5955
Identification of Mammalian Protein Quality Control Factors by High-throughput Cellular Imaging
Pegoraro, G.; Voss, T.C.; Martin, S.E.; Tuzmen, P.; Guha, R.; Mistelli, T.
PLoS One, 2012, 7, e31684

[ Abstract ]
[DOI 10.1371/journal.pone.0031684 ]

Protein Quality Control (PQC) pathways are essential to maintain the equilibrium between protein folding and the clearance of misfolded proteins. In order to discover novel human PQC factors, we developed a high-content, high-throughput cell-based assay to assess PQC activity. The assay is based on a fluorescently tagged, temperature sensitive PQC substrate and measures its degradation relative to a temperature insensitive internal control. In a targeted screen of 1591 siRNA genes involved in the Ubiquitin-Proteasome System (UPS) we identified 25 of the 33 genes encoding for 26S proteasome subunits and discovered several novel PQC factors. An unbiased genome-wide siRNA screen revealed the protein translation machinery, and in particular the EIF3 translation initiation complex, as a novel key modulator of misfolded protein stability. These results represent a comprehensive unbiased survey of human PQC components and establish an experimental tool for the discovery of genes that are required for the degradation of misfolded proteins under conditions of proteotoxic stress.

High-Throughput Screening For Genes That Prevent Excess DNA Replication In Human Cells And For Molecules That Inhibit Them
Lee, C.; Johnson, R.L.; Wichterman-Kouznetsova, J.; Guha, R.; Ferrer, M.; Tuzmen, P.; Martin, S.; Zhu, W.; Depamphilis, M.L.
Methods, 2012, 57, 234-248

[ Abstract ]
[DOI 10.1016/j.ymeth.2012.03.031 ]

High-throughput screening (HTS) provides a rapid and comprehensive approach to identifying compounds that target specific biological processes as well as genes that are essential to those processes. Here we describe a HTS assay for small molecules that induce either DNA re-replication or endoreduplication (i.e. excess DNA replication) selectively in cells derived from human cancers. Such molecules will be useful not only to investigate cell division and differentiation, but they may provide a novel approach to cancer chemotherapy. Since induction of DNA re-replication results in apoptosis, compounds that selectively induce DNA re-replication in cancer cells without doing so in normal cells could kill cancers in vivo without preventing normal cell proliferation. Furthermore, the same HTS assay can be adapted to screen siRNA molecules to identify genes whose products restrict genome duplication to once per cell division. Some of these genes might regulate the formation of terminally differentiated polyploid cells during normal human development, whereas others will prevent DNA re-replication during each cell division. Based on previous studies, we anticipate that one or more of the latter genes will prove to be essential for proliferation of cancer cells but not for normal cells, since many cancer cells are deficient in mechanisms that maintain genome stability.
Cheminformatics, the Computer Science of Chemical Discovery Turning Open Source
Sterling, A.; Wegner, J.K.; Guha, R.; Bender, A.; Faulon, J.; Hastings, J.; O’Boyle, N.; Overington, J.P.; Vlijmen, H.V.; Willighagen, E.
Comm. ACM, 2012, 55, 65-75

[ Abstract ]
[DOI 10.1145/2366316.2366334 ]

One of the most prominent success stories in all the sciences over
the last decade has been the advance of bioinformatics: the
interdisciplinary collaboration between computer scientists and
molecular biologists that led to the
sequencing of the human genome and other accomplishments. However,
few computer scientists are familiar
with a related discipline: cheminformatics, the use of computers to
represent the structures of small molecules and analyze their
properties. Cheminformatics has wide applicability, from the
drug discovery to agrochemicals and materials design.
While researchers in both academia and industry have made important
contributions to this field for decades, new and
exciting collaborative opportunities have arisen from an “opening” of
data and software as an effect of changing mindsets,
policy changes,
and chemists volunteering time
for “Open Science”. Researchers have gained access to
freely available open source software packages and open databases of tens of millions
of chemicals, allowing academic chemists to confront a variety of
algorithmic problems whose solutions will be critical to address
current challenges ranging from
determining the behavior of small molecules in biological pathways,
to finding therapies for rare and neglected diseases.
In this paper, we give a broad overview of the field of cheminformatics with a
focus on open questions and challenges.
Exploring Uncharted Territories — Predicting Activty Cliffs in Structure-Activity Landscapes
Guha, R.
J. Chem. Inf. Model., 2012, 52, 2181-2191

[ Abstract ]
[DOI 10.1021/ci300047k ]

The notion of activity cliffs is an intuitive approach to characterizing structural features that
play a key role in modulating biological activity of a molecule. A variety of methods have been
described to quantitatively characterize activity cliffs, such as SALI and SARI. However, these
methods are primarily retrospective in nature; highlighting cliffs that are already present in the
dataset. The current study focuses on employing a pairwise characterization of a dataset to train
a model to predict whether a new molecule will exhibit an activity cliff with one or more members
of the dataset. The approach is based on predicting a value for pairs of objects rather than the
individual objects themselves (and thus allows for robust models even for small structure-activity
relationship datasets). We extracted structure-activity data for several ChEMBL assays and
developed random forest models to predict SALI values, from pairwise combinations of molecular
descriptors. The models exhibited reasonable RMSE’s though, surprisingly, performance on the more
significant cliffs tended to be better than on the lesser ones. While the models do not exhibit
very high levels of accuracy, our results indicate that they are able to prioritize molecules in
terms of their ability to activity cliffs, thus serving as a tool to prospectively identify
activity cliffs.
Dealing with the Data Deluge: Handling the Multitude of Chemical Biology Data Sources
Guha, R.; Nguyen, D.-T.; Southall, N.; Jadhav, A.
Curr. Protocols Chem. Biol., 2012, 4, 193-209

[ Abstract ]
[DOI 10.1002/9780470559277.ch110262 ]

Over the last 20 years, there has been an explosion in the amount and type of biological and chemical data that has been made publicly available in a variety of online databases. While this means that vast amounts of information can be found online, there is no guarantee that it can be found easily (or at all). A scientist searching for a specific piece of information is faced with a daunting task—many databases have overlapping content, use their own identifiers and, in some cases, have arcane and unintuitive user interfaces. In this overview, a variety of well-known data sources for chemical and biological information are highlighted, focusing on those most useful for chemical biology research. The issue of using data from multiple sources and the associated problems such as identifier disambiguation are highlighted. A brief discussion is then provided on Tripod, a recently developed platform that supports the integration of arbitrary data sources, providing users a simple interface to search across a federated collection of resources.
A Furoxan-Amodiaquine Hybrid as a Potential Therapeutic for Three Parasitic Diseases
Mott, B.T.; Cheng, C.C.; Guha, R.; Kommer, V.P.; Williams, D.L.; Vermeire, J.J.; Cappello, M.; Maloney, D.J.; Rai, G.; Jadhav, A.; Simeonov, A.; Inglese, J.; Posner, G.H; Thomas, C.J.
Med. Chem. Comm., 2012, 3, 1505-1511

[ Abstract ]
[DOI 10.1039/C2MD20238G ]

Parasitic diseases continue to have a devastating impact on human populations worldwide. Lack of effective treatments, the high cost of existing ones, and frequent emergence of resistance to these agents provide a strong argument for the development of novel therapies. Here we report the results of a hybrid approach designed to obtain a dual acting molecule that would demonstrate activity against a variety of parasitic targets. The antimalarial drug amodiaquine has been covalently joined with a nitric oxide-releasing furoxan to achieve multiple mechanisms of action. Using in vitro and ex vivo assays, the hybrid molecule shows activity against three parasites — Plasmodium falciparum, Schistosoma mansoni, and Ancylostoma ceylanicum.
Diversity-Oriented Synthesis Yields a Novel Lead for the Treatment of Malaria
Heidebrecht, R.W.; Mulrooney, C.; Austin, C.P.; Barker, R.H.; Beaudoin, J.A.; Cheng, K.Chih-Chien.; Comer, E.; Dandapani, S.; Dick, J.; Duvall, J.R.; Ekland, E.H.; Fidock, D.A.; Fitzgerald, M.E.; Foley, M.; Guha, R.; Hinkson, P.; Kramer, M.; Lukens, A.K.; Masi, D.; Marcaurelle, L.A.; Su, X.; Thomas, C.J.; Wewer, M.; Wiegand, R.C.; Wirth, D.; Xia, M.; Yuan, J.; Zhao, J.; Palmer, M.; Munoz, B.; Schreiber, S.
ACS Med. Chem. Lett., 2012, 3, 112-117

[ Abstract ]
[ Link ]

Here, we describe the discovery of a novel antimalarial agent using phenotypic screening of Plasmodium falciparum asexual blood-stage parasites. Screening a novel compound collection created using diversity-oriented synthesis (DOS) led to the initial hit. Structure–activity relationships guided the synthesis of compounds having improved potency and water solubility, yielding a subnanomolar inhibitor of parasite asexual blood-stage growth. Optimized compound 27 has an excellent off-target activity profile in erythrocyte lysis and HepG2 assays and is stable in human plasma. This compound is available via the molecular libraries probe production centers network (MLPCN) and is designated ML238.
Exploiting Synthetic Lethality for the Therapy of ABC Diffuse Large B Cell Lymphoma
Yang, Y.; Shaffer, A.; Emre, N. C. ?Tolga; Ceribelli, M.; Zhang, M.; Wright, G.; Xiao, W.; Powell, J.; Platig, J.; Kohlhammer, H.; Young, R.; Zhao, H.; Yang, Y.; Xu, W.; Buggy, J.; Balasubramanian, S.; Mathews, L.; Shinn, P.; Guha, R.; Ferrer, M.; Thomas, C.; Waldmann, T.; Staudt, L.
Cancer Cell, 2012, 21, 723-737

[ Abstract ]
[ Link ]

Exploring Structure-Activity Data Using the Landscape Paradigm
Guha, R.
WIREs Comput. Mol. Sci., 2012, 2, 829-841

[ Abstract ]
[DOI 10.1002/wcms.1087 ]

In this article, we present an overview of the origin and applications of the activity landscape view of structure–activity relationship (SAR) data as conceived by Shanmugasundaram and Maggiora. Within this landscape, different regions exemplify different aspects of SAR trends—ranging from smoothly varying trends to discontinuous trends (also termed activity cliffs). We discuss the various definitions of landscapes and cliffs that have been proposed as well as different approaches to the numerical quantification of a landscape. We then highlight some of the landscape visualization approaches that have been developed, followed by a review of the various applications of activity landscapes and cliffs to topics in medicinal chemistry and SAR analysis.
A 1536-well Quantitative High Throughput Screen to Identify Compounds Targeting Cancer Stem Cells
Mathews, L.A.; Keller, J.M.; Goodwin, B.; Guha, R.; Shinn, P.; Mull, R.; Thomas, C.; de Kluyver, R.; Sayers, T.; Ferrer, M.
J. Biomol. Screen., 2012, 17, 1231-1242

[ Abstract ]
[DOI 10.1177/1087057112458152 ]

Tumor cell subpopulations called cancer stem cells (CSCs) or tumor-initiating cells (TICs) have self-renewal potential and are thought to drive metastasis and tumor formation. Data suggest that these cells are resistant to current chemotherapy and radiation therapy treatments, leading to cancer recurrence. Therefore, finding new drugs and/or drug combinations that cause death of both the differentiated tumor cells as well as CSC populations is a critical unmet medical need. Here, we describe how cancer-derived CSCs are generated from cancer cell lines using stem cell growth media and nonadherent conditions in quantities that enable high-throughput screening (HTS). A cell growth assay in a 1536-well microplate format was developed with these CSCs and used to screen a focused collection of oncology drugs and clinical candidates to find compounds that are cytotoxic against these highly aggressive cells. A hit selection process that included potency and efficacy measurements during the primary screen allowed us to efficiently identify compounds with potent cytotoxic effects against spheroid-derived CSCs. Overall, this research demonstrates one of the first miniaturized HTS assays using CSCs. The procedures described here should enable further testing of the effect of compounds on CSCs and help determine which pathways need to be targeted to kill them.
A Survey of Quantitative Descriptions of Molecular Structure
Guha, R.; Willighagen, E.L.
<
Curr. Topics Med. Chem., 2012, 12, 1946-1956

[ Abstract ]
[DOI 10.2174/156802612804910278 ]

Numerical characterization of molecular structure is a first step in
many computational analysis of chemical structure data. These numerical
representations, termed descriptors, come in many forms, ranging
from simple atom counts and invariants of the molecular graph to
distribution of properties, such as charge, across a molecular
surface. In this article we first present a broad categorization of
descriptors and then describe applications and toolkits that can be
employed to evaluate them. We highlight a number of issues
surrounding molecular descriptor calculations such as versioning and
reproducibility and describe how some toolkits have attempted to
address these problems.
Genome-Wide RNAi Screen For Lysosomal Storage Disorders
Velayati, A.; Tuzmen, P.; Guha, R.; Martin, S.; Goldin, E.; Sidransky, E.
Molecular Genetics and Metabolism, 2012, 105, S63
Chemical Genomic Profiling for Antimalarial Therapies, Response Signatures, and Molecular Targets
Yuan, J.; Cheng, K.Chih-Chien.; Johnson, R.L.; Huang, R.; Pattaradilokrat, S.; Liu, A.; Guha, R.; Fidock, D.A.; Inglese, J.; Wellems, T.E.; Austin, C.P.; Su, X.
Science, 2011, 333, 724-729

[ Abstract ]
[DOI 10.1021/ci200081k ]

Malaria remains a devastating disease largely because of widespread drug resistance. New drugs and a better understanding of the mechanisms of drug action and resistance are essential for fulfilling the promise of eradicating malaria. Using high-throughput chemical screening and genome-wide association analysis, we identified 32 highly active compounds and genetic loci associated with differential chemical phenotypes (DCPs), defined as greater than or equal to fivefold differences in half-maximum inhibitor concentration (IC(50)) between parasite lines. Chromosomal loci associated with 49 DCPs were confirmed by linkage analysis and tests of genetically modified parasites, including three genes that were linked to 96\% of the DCPs. Drugs whose responses mapped to wild-type or mutant pfcrt alleles were tested in combination in vitro and in vivo, which yielded promising new leads for antimalarial treatments.
KNIME Workflow to Assess PAINS Filters in SMARTS Format. Comparison of RDKit and Indigo Cheminformatics Libraries
Saubern, S.; Guha, R.; Baell, J.B.
Mol. Inf., 2011, 30, 847-850
Open Data, Open Source and Open Standards in Chemistry: The Blue Obelisk Five Years On
O’Boyle, N.; Guha, R.; Willighagen, E.; Adams, S.E.; Alvarsson, J.; Bradley, J.C.; Filippov, I.; Hanson, R.M.; Hanwell, M.D.; Hutchison, G.R.; James, C.A.; Jeliazkova, N.; Lang, A.; Langner, K.M.; Lonie, D.C.; Lowe, D.M.; Pansanel, J.; Pavlov, D.; Spjuth, O.; Steinbeck, C.; Tenderholt, A.; Theisen, K.; Murray-Rust, P.
J. Cheminf., 2011, 3,

[ Abstract ]
[DOI 10.1186/1758-2946-3-37 ]

Background
The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards.

Results
This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry.

Conclusions
We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community.

Exploratory Analysis of Kinetic Solubility Measurements of a Small Molecule Library
Guha, R.; Dexheimer, T.S.; Kestranek, A.N.; Jadhav, A.; Chervenak, A.M.; Ford, M.G.; Simeonov, A.; Roth, G.P.; Thomas, C.J.
Bioorg. Med. Chem., 2011, 19, 4127-4134

[ Abstract ]
[DOI 10.1016/j.bmc.2011.05.005 ]

Kinetic solubility measurements using prototypical assay buffer conditions are presented for a ?58,000 member library of small molecules. Analyses of the data based upon physical and calculated properties of each individual molecule were performed and resulting trends were considered in the context of commonly held opinions of how physicochemical properties influence aqueous solubility. We further analyze the data using a decision tree model for solubility prediction and via a multi-dimensional assessment of physicochemical relationships to solubility in the context of specific ‘rule-breakers’ relative to common dogma. The role of solubility as a determinant of assay outcome is also considered based upon each compound’s cross-assay activity score for a collection of publicly available screening results. Further, the role of solubility as a governing factor for colloidal aggregation formation within a specified assay setting is examined and considered as a possible cause of a high cross-assay activity score. The results of this solubility profile should aid chemists during library design and optimization efforts and represent a useful training set for computational solubility prediction.
RNAi Screening Identifies TAK1 as a Potential Target for the Enhanced Efficacy of Topoisomerase Inhibitors
Martin, S.E.; Wu, Z.H.; Gehlhaus, K.Jones.; Zhang, Y.W.; Guha, R.; Miyamoto, S.; Pommier, Y.; Caplen, N.J.
Curr. Cancer Drug Targets, 2011, 11, 976-986

[ Abstract ]
[DOI 10.1002/cmdc.201100179 ]

In an effort to develop strategies that improve the efficacy of existing anticancer agents, we have conducted a siRNA-based RNAi screen to identify genes that, when targeted by siRNA, improve the activity of the topoisomerase I (Top1) poison camptothecin (CPT). Screening was conducted using a set of siRNAs corresponding to over 400 apoptosis-related genes in MDA-MB-231 breast cancer cells. During the course of these studies, we identified the silencing of MAP3K7 as a significant enhancer of CPT activity. Follow-up analysis of caspase activity and caspase-dependent phosphorylation of histone H2AX demonstrated that the silencing of MAP3K7 enhanced CPT-associated apoptosis. Silencing MAP3K7 also sensitized cells to additional compounds, including CPT clinical analogs. This activity was not restricted to MDA-MB-231 cells, as the silencing of MAP3K7 also sensitized the breast cancer cell line MDA-MB-468 and HCT-116 colon cancer cells. However, MAP3K7 silencing did not affect compound activity in the comparatively normal mammary epithelial cell line MCF10A, as well as some additional tumorigenic lines. MAP3K7 encodes the TAK1 kinase, an enzyme that is central to the regulation of many processes associated with the growth of cancer cells (e.g. NF-kB, JNK, and p38 signaling). An analysis of TAK1 signaling pathway members revealed that the silencing of TAB2 also sensitizes MDA-MB-231 and HCT-116 cells towards CPT. These findings may offer avenues towards lowering the effective doses of Top1 inhibitors in cancer cells and, in doing so, broaden their application.
Improving Usability and Accessibility of Cheminformatics Tools for Chemists Through Cyberinfrastructure and Education
Guha, R.; Wiggins, G.D.; Wild, D.J.; Baik, M.H.; Pierce, M.E.; Fox, G.C.
In Silico Biol., 2011, 11, 41-60
Discovery of New Antimalarial Chemotypes Through Chemical Methodology and Library Development
Brown, L.E.; Chih-Chien Cheng, K.; Wei, W.; Yuan, P.; Dai, P.; Trilles, R.; Ni, F.; Yuan, J.; MacArthur, R.; Guha, R.; Johnson, R.L.; Su, X.; Dominguez, M.M.; Snyder, J.K.; Beeler, A.B.; Schaus, S.E.; Inglese, J.; Porco, J.
Proc. Nat. Acad. Sci., 2011, 108, 6775-6780

[ Abstract ]
[DOI 10.1073/pnas.1017666108 ]

In an effort to expand the stereochemical and structural complexity of chemical libraries used in drug discovery, the Center for Chemical Methodology and Library Development at Boston University has established an infrastructure to translate methodologies accessing diverse chemotypes into arrayed libraries for biological evaluation. In a collaborative effort, the NIH Chemical Genomics Center determined IC(50)’s for Plasmodium falciparum viability for each of 2,070 members of the CMLD-BU compound collection using quantitative high-throughput screening across five parasite lines of distinct geographic origin. Three compound classes displaying either differential or comprehensive antimalarial activity across the lines were identified, and the nascent structure activity relationships (SAR) from this experiment used to initiate optimization of these chemotypes for further development.
Advances in Cheminformatics Methodologies and Infrastructure to Support the Data Mining of Large, Heterogeneous Chemical Datasets
Guha, R.; Gilbert, K.; Fox, G.C.; Pierce, M.; Wild, D.; Yuan, H.
Curr. Comp. Aid. Drug Des., 2010, 6, 50-67
In recent years, there has been an explosion in the availability of publicly accessible chemical information, including chemical structures of small molecules, structure-derived properties and associated biological activities in a variety of assays. These data sources present us with a significant opportunity to develop and apply computational tools to extract and understand the underlying structure-activity relationships. Furthermore, by integrating chemical data sources with biological information (protein structure, gene expression and so on), we can attempt to build up a holistic view of the effects of small molecules in biological systems. Equally important is the ability for non-experts to access and utilize state of the art cheminformatics method and models. In this review we present recent developments in cheminformatics methodologies and infrastructure that provide a robust, distributed approach to mining large and complex chemical datasets. In the area of methodology development, we highlight recent work on characterizing structure-activity landscapes, QSAR model domain applicability and the use of chemical similarity in text mining. In the area of infrastructure, we discuss a distributed web services framework that allows easy deployment and uniform access to
computational (statistics, cheminformatics and computational chemistry) methods, data and models. We also discuss the development of PubChem derived databases and highlight techniques that allow us to scale the infrastructure to extremely large compound collections, by use of distributed processing on Grids. Given that the above work is applicable to arbitrary types of cheminformatics problems, we also present some case studies related to virtual screening for anti-malarials and predictions of anti- cancer activity.
Use of Genetic Algorithm and Neural Network Approaches for Risk Factor Selection: A Case Study of West Nile Virus Dynamics in an Urban Environment
Ghosh, D.; Guha, R.
Computers, Environment and Urban Systems, 2010, 34, 189-203

[ Abstract ]
[DOI 10.1016/j.compenvurbsys.2010.02.007 ]

The West Nile virus (WNV) is an infectious disease spreading rapidly throughout the United States, causing illness among thousands of birds, animals, and humans. Yet, we only have a rudimentary understanding of how the mosquito-borne virus operates in complex avian–human environmental systems coupled with risk factors. The large array of multidimensional risk factors underlying WNV incidences is environmental, built-environment, socioeconomic, and existing mosquito abatement policies. Therefore it is essential to identify an optimal number of risk factors whose management would result in effective disease prevention and containment. Previous models built to select important risk factors assumed a priori that there is a linear relationship between these risk factors and disease incidences. However, it is difficult for linear models to incorporate the complexity of the WNV transmission network and hence identify an optimal number of risk factors objectively.
There are two objectives of this paper, first, use combination of genetic algorithm (GA) and computational neural network (CNN) approaches to build a model incorporating the non-linearity between incidences and hypothesized risk factors. Here GA is used for risk factor (variable) selection and CNN for model building mainly because of their ability to capture complex relationships with higher accuracy than linear models. The second objective is to propose a method to measure the relative importance of the selected risk factors included in the model. The study is situated in the metropolitan area of Minnesota, which had experienced significant outbreaks from 2002 till present.
Towards Interoperable and Reproducible QSAR Analyses: Exchange of Data Sets’
Spujth, O.; Willighagen, E.L.; Guha, R.; Eklund, M.; Wikberg, J.E.S.
J. Cheminformatics, 2010, 2,

[ Abstract ]
[ Link ]

QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data.
Towards interoperable and reproducible QSAR analyses: Exchange of datasets.
Spjuth, O.; Willighagen, E.L.; Guha, R.; Eklund, M.; Wikberg, J.
Journal of Cheminformatics, 2010, 2,

[ Abstract ]
[DOI 10.1186/1758-2946-2-5 ]

ABSTRACT: BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effo
rt has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue i
s the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior
to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analys
es and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproduc
ible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of
uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a datase
t described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup o
f QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations
from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up
new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining
which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes
it easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model’s performan
ce. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.
PubChem as a Source of Polypharmacology
Chen, B.; Wild, D.; Guha, R.
J. Chem. Inf. Model., 2009, 49, 2044-2055

[ Abstract ]
[DOI 10.1021/ci9001876 ]

Polypharmacology provides a new way to address the issue of high
attrition rates arising from lack of efficacy and toxicity. However,
the development of polypharmacology is hampered by the incomplete
SAR data and limited resources for validating target
combinations. The PubChem bioassay collection, reporting the activity of
compounds in multiple assays, allows us to study polypharmacological
behavior in the PubChem collection via cross-assay analysis. In this
paper, we developed a network representation of the assay collection
and then applied a bipartite mapping between this network and
various biological networks (i.e., PPI, pathway) as well as
artificial networks (i.e., drug-target network). Mapping to a
drug-target network allows us to prioritize new selective
compounds, while mapping to other biological networks enable us to
observe interesting target pairs and their associated compounds
in the context of biological systems. Our results indicate this
approach could be a useful way to investigate polypharmacology in
the PubChem bioassay collection.
Chemoinformatic Analysis of Drugs, Natural Products, Molecular Libraries Small Molecule Repository and Combinatorial Libraries
Singh, N.; Guha, R.; Guilianotti, M.; Houghten, R.; Medina-Franco, J.L.
J. Chem. Inf. Model., 2009, 49, 1010-1024

[ Abstract ]
[DOI 10.1021/ci800426u ]

A multiple criteria approach is presented, that is used to perform a comparative analysis of four recently developed combinatorial libraries to drugs, Molecular Libraries Small Molecule Repository (MLSMR) and natural products. The compound databases were assessed in terms of physicochemical properties, scaffolds, and fingerprints. The approach enables the analysis of property space coverage, degree of overlap between collections, scaffold and structural diversity, and overall structural novelty. The degree of overlap between combinatorial libraries and drugs was assessed using the R-NN curve methodology, which measures the density of chemical space around a query molecule embedded in the chemical space of a target collection. The combinatorial libraries studied in this work exhibit scaffolds that were not observed in the drug, MLSMR, and natural products databases. The fingerprint-based comparisons indicate that these combinatorial libraries are structurally different than current drugs. The R-NN curve methodology revealed that a proportion of molecules in the combinatorial libraries is located within the property space of the drugs. However, the R-NN analysis also showed that there are a significant number of molecules in several combinatorial libraries that are located in sparse regions of the drug space.
Navigating Structure Activity Landscapes
Bajorath, J.; Peltason, L.; Wawer, M.; Guha, R.; Lajiness, M.S.; van Drie, J.H.
Drug Discov. Today, 2009, 14, 698-705

[ Abstract ]
[DOI 10.1016/j.drudis.2009.04.003 ]

The problem of how to systematically explore structure-activity relationships (SARs) is still largely unsolved in medicinal chemistry. Recently, data analysis tools have been introduced to navigate activity landscapes and assess structure-activity relationships on a large scale. Initial investigations reveal a surprising heterogeneity among SARs and shed light on the relationship between `global’ and `local’ SAR features. Moreover, insights are provided into the fundamental issue of why modeling tools work well in some cases, but not in others.
Pharmacophore Representation and Searching
Guha, R.; Van Drie, J.H.
CDK News, 2008, ASAP

[ Abstract ]
[ Link ]

In this article we describe the design and use of a set of Java classes to represent pharmacophores and use such representations in pharmacophore searching applications.
Assessing How Well a Modeling Protocol Captures a Structure-Activity Landscape
Guha, R.; Van Drie, J.H.
J. Chem. Inf. Model., 2008, 48, 1716-1728

[ Abstract ]
[DOI 10.1021/ci8001414 ]

We introduce the notion of structure-activity landscape index (SALI) curves as a way to assess a model and a modeling protocol, applied to structure-activity relationships. We start from our earlier work [J. Chem. Inf. Model., 2008, 48, 646-658], where we show how to study a structure-activity relationship pairwise, based on the notion of “activity cliffs” – pairs of molecules that are structurally similar but have large differences in activity. There, we also introduced the SALI parameter, which allows one to identify cliffs easily, and which allows one to represent a structure-activity relationship as a graph. This graph orders every pair of molecules by their activity. Here, we introduce the new idea of a SALI curve, which tallies how many of these orderings a model is able to predict. Empirically, testing these SALI curves against a variety of models, ranging over two-dimensional quantitative structure-activity relationship (2D-QSAR), three-dimensional quantitative structure-activity relationship (3D-QSAR), and structure-based design models, the utility of a model seems to correspond to characteristics of these curves. In particular, the integral of these curves, denoted as SCI and being a number ranging from -1.0 to 1.0, approaches a value of 1.0 for two literature models, which are both known to be prospectively useful.
The Structure-Activity Landscape Index: Identifying and Quantifying Activity-Cliffs
Guha, R.; Van Drie, J.H.
J. Chem. Inf. Model., 2008, 48, 646-658

[ Abstract ]
[DOI 10.1021/ci7004093 ]

A new method for analyzing a structure-activity relationship is proposed. By use of a simple quantitative index, one can readily identify “structure-activity cliffs”: pairs of molecules which are most similar but have the largest change in activity. We show how this provides a graphical representation of the entire SAR, in a way that allows the salient features of the SAR to be quickly grasped. In addition, the approach allows us view the SARs in a data set at different levels of detail. The method is tested on two data sets that highlight its ability to easily extract SAR information. Finally, we demonstrate that this method is robust using a variety of computational control experiments and discuss possible applications of this technique to QSAR model evaluation.
A Flexible Web Service Infrastructure for the Development and Deployment of Predictive Models
Guha, R.
J. Chem. Inf. Model., 2008, 48, 456-464

[ Abstract ]
[DOI 10.1021/ci700188u ]

The development of predictive statistical models is a common task in
the field of drug design. The process of developing such models
involves two main steps: building the model and then deploying the
model. Traditionally such models have been deployed using web page
interfaces. This approach restricts the user to using the specified
web page and using the model in other ways can be cumbersome. In
this paper we present a flexible and generalizable approach to the
deployment of predictive models, based on a web service
infrastructure using R. The infrastructure described allows one to
access the functionality of these models using a variety of approach
ranging from web pages to workflow tools. We highlight the
advantages of this infrastructure by developing and subsequently
deploying random forest models for two datasets.
On the Interpretation and Interpretability of QSAR Models
Guha, R.
J. Comp. Aid. Molec. Des., 2008, 22, 857-871

[ Abstract ]
[DOI 10.1007/s10822-008-9240-5 ]

The goal of a quantitative structure–activity relationship (QSAR) model is to encode the relationship between molecular structure and biological activity or physical property. Based on this encoding, such models can be used for predictive purposes. Assuming the use of relevant and meaningful descriptors, and a statistically significant model, extraction of the encoded structure–activity relationships (SARs) can provide insight into what makes a molecule active or inactive. Such analyses by QSAR models are useful in a number of scenarios, such as suggesting structural modifications to enhance activity, explanation of outliers and exploratory analysis of novel SARs. In this paper we discuss the need for interpretation and an overview of the factors that affect interpretability of QSAR models. We then describe interpretation protocols for different types of models, highlighting the different types of interpretations, ranging from very broad, global, trends to very specific, case-by-case, descriptions of the SAR, using examples from the training set. Finally, we discuss a number of case studies where workers have provide some form of interpretation of a QSAR model.
Utilizing High Throughput Screening Data for Predictive Toxicology Models: Protocols and Application to MLSCN Assays
Guha, R.; Sch\”urer, S.C.
J. Comp. Aid. Molec. Des., 2008, 22, 367-384

[ Abstract ]
[DOI 10.1007/s10822-008-9192-9 ]

Computational toxicology is emerging as an encouraging alternative to experimental testing. The Molecular Libraries Screening Center Network (MLSCN) as part of the NIH Molecular Libraries Roadmap has recently started generating large and diverse screening datasets, which are publicly available in PubChem. In this report, we investigate various aspects of developing computational models to predict cell toxicity based on cell proliferation screening data generated in the MLSCN. By capturing feature-based information in those datasets, such predictive models would be useful in evaluating cell-based screening results in general (for example from reporter assays) and could be used as an aid to identify and eliminate potentially undesired compounds. Specifically we present the results of random forest ensemble models developed using different cell proliferation datasets and highlight protocols to take into account their extremely imbalanced nature. Depending on the nature of the datasets and the descriptors employed we were able to achieve percentage correct classification rates between 70% and 85% on the prediction set, though the accuracy rate dropped significantly when the models were applied to in vivo data. In this context we also compare the MLSCN cell proliferation results with animal acute toxicity data to investigate to what extent animal toxicity can be correlated and potentially predicted by proliferation results. Finally, we present a visualization technique that allows one to compare a new dataset to the training set of the models to decide whether the new dataset may be reliably predicted.
Userscripts for the Life Sciences
Willighagen, E.L.; O’Boyle, N.; Gopalakrishnan, H.; Jiao, D.; Guha, R.; Steinbeck, C.; Wild, D.J.
BMC Bioinformatics, 2007, 8, 487

[ Abstract ]
[DOI 10.1186/1471-2105-8-487 ]

The web has seen an explosion of chemistry
and biology related resources in the last 15 years: thousands of
scientific journals, databases, wikis, blogs and resources are available with
a wide variety of types of information. There is a huge need to aggregate
and organise this information. However, the sheer number of resources makes it
unrealistic to link them all in a centralised manner. Instead,
search engines to find information in those resources flourish, and formal
languages like Resource Description Framework and Web Ontology Language
are increasingly used to allow linking of resources.
A recent development is the use of userscripts to change the appearance of
web pages, by on-the-fly modification of the web content. This opens
possibilities to aggregate information and computational results from
different web resources into the web page of one of those resources.
Chemical Data Mining of the NCI Human Tumor Cell Line Database
Wang, H.; Klinginsmith, J.; Dong, X.; Lee, A.; Guha, R.; Wu, Y.; Crippen, G.; Wild, D.J.
J. Chem. Inf. Model., 2007, 47, 2063-2076

[ Abstract ]
[DOI 10.1021/ci700141x ]

The NCI Developmental Therapeutics Program Human Tumor cell line data set is a publicly available database that contains cellular assay screening data for over 40 000 compounds tested in 60 human tumor cell lines. The database also contains microarray assay gene expression data for the cell lines, and so it provides an excellent information resource particularly for testing data mining methods that bridge chemical, biological, and genomic information. In this paper we describe a formal knowledge discovery approach to characterizing and data mining this set and report the results of some of our initial experiments in mining the set from a chemoinformatics perspective.
Counting Clusters Using R-NN Curves
Guha, R.; Dutta, D.; Chen, T.; Wild, D.J.
J. Chem. Inf. Model., 2007, 47, 1308-1318

[ Abstract ]
[DOI 10.1021/ci600541f ]

Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for non-hierarchical clustering methods, such as $k$-means, is the number of clusters, k. Traditionally the value of $k$ is obtained by performing the clustering with different values of $k$ and selecting that value that leads to the optimal clustering. In this study we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J.~Chem.~Inf.~Model., 2006, 46, 1713-1722) which uses a nearest neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the dataset which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical datasets. Our results indicate the the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters
A Web Service Infrastructure for Chemoinformatics
Dong, X.; Gilbert, K.; Guha, R.; Heiland, R.; Kim, J.; Pierce, M.; Fox, G.; Wild, D.J.
J. Chem. Inf. Model., 2007, 47, 1303-1307

[ Abstract ]
[DOI 10.1021/ci6004349 ]

The vast increase of pertinent information available to drug discovery scientists means that there is strong demand for tools and techniques for organizing and intelligently mining this information for manageable human consumption. At Indiana University, we have developed an infrastructure of chemoinformatics web services that simplify the access to this information and the computational techniques that can be applied to it. In this paper, we describe this infrastructure, give some examples of its use, and then discuss our plans to use it as a platform for chemoinformatics application development in the future.
Ensemble Feature Selection: Consistent Descriptor Subsets for Multiple QSAR Models
Dutta, D.; Guha, R.; Chen, T.; Wild, D.J.
J. Chem. Inf. Model., 2007, 47, 989-997

[ Abstract ]
[DOI 10.1021/ci600563w ]

Selecting a small subset of descriptors from a large pool to build a predictive QSAR model is an important step in the QSAR modeling process. In general subset selection is very hard to solve, even approximately, with guaranteed performance bounds. Traditional approaches employ deterministic or stochastic methods to obtain a descriptor subset that leads to an optimal model of a single type (such as linear regression or a neural network). With the development of ensemble modeling approaches, multiple models of differing types are individually developed resulting in different descriptor subsets for each model type. However it is advantageous, from the point of view of developing interpretable QSAR models, to have a single set of descriptors that can be used for different model types. In this paper, we describe an approach to the selection of a single, optimal, subset of descriptors for multiple model types. We apply this approach to three datasets, covering both regression and classification, and show that the constraint of forcing different model types to use the same set of descriptors does not lead to a significant loss in predictive ability for the individual models considered. In addition, interpretations of the individual models developed using this approach indicate that they encode similar structure-activity trends.
Chemical Informatics Functionality in R
Guha, R.
J. Stat. Soft., 2007, 18,

[ Abstract ]
[ Link ]

The flexibility and scope of the R programming environment has made it a popular choice for statistical modeling and scientific prototyping in a number of fields. In the field of chemistry, R provides several tools for a variety of problems related to statistical modeling of chemical information. However, one aspect common to these tools is that they do not have direct access to the information that is available from chemical structures, such as contained in molecular descriptors.

We describe the rcdk package that provides the R user with access to the CDK, a Java framework for cheminformatics. As a result, it is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints. In addition, we describe the rpubchem that will allow access to the data in PubChem, a public repository of molecular structures and associated assay data for approximately 8 million compounds. Currently the package allows access to structural information as well as some simple molecular properties from PubChem. In addition the package allows access to bio-assay data from the PubChem FTP servers.

Local Lazy Regression: Making Use of the Neighborhood to Improve QSAR Predictions.
Guha, R.; Dutta, D.; Jurs, P.C.; Chen, T.
J. Chem. Inf. Model., 2006, 46, 1836-1847

[ Abstract ]
[DOI 10.1021/ci060064e ]

Traditional quantitative structure-activity relationship (QSAR) models aim to capture global structure-activity trends present in a data set. In many situations, there may be groups of molecules which exhibit a specific set of features which relate to their activity or inactivity. Such a group of features can be said to represent a local structure-activity relationship. Traditional QSAR models may not recognize such local relationships. In this work, we investigate the use of local lazy regression (LLR), which obtains a prediction for a query molecule using its local neighborhood, rather than considering the whole data set. This modeling approach is especially useful for very large data sets because no a priori model need be built. We applied the technique to three biological data sets. In the first case, the root-mean-square error (RMSE) for an external prediction set was 0.94 log units versus 0.92 log units for the global model. However, LLR was able to characterize a specific group of anomalous molecules with much better accuracy (0.64 log units versus 0.70 log units for the global model). For the second data set, the LLR technique resulted in a decrease in RMSE from 0.36 log units to 0.31 log units for the external prediction set. In the third case, we obtained an RMSE of 2.01 log units versus 2.16 log units for the global model. In all cases, LLR led to a few observations being poorly predicted compared to the global model. We present an analysis of why this was observed and possible improvements to the local regression approach.
R-NN Curves: An Intuitive Approach to Outlier Detection Using a Distance Based Method
Guha, R.; Dutta, D.; Jurs, P.C; Chen, T.
J. Chem. Inf. Model., 2006, 46, 1713-1722

[ Abstract ]
[DOI 10.1021/ci060013h ]

Libraries of chemical structures are used in a variety of cheminformatics tasks such as virtual screening and QSAR modeling and are generally characterized using molecular descriptors. When working with libraries it is useful to understand the distribution of compounds in the space defined by a set of descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library in general and outlier detection in particular based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, appear to follow a logistic model for any given descriptor space, which we justify theoretically for the 2D case. The method can be applied to data sets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We also present a method to numerically characterize the R-NN curves thus allowing identification of outliers in a single plot.
The Blue Obelisk–Interoperability in Chemical Informatics.
Guha, R.; Howard, M.T.; Hutchison, G.R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C.; Wegner, J.; Willighagen, E.L.
J. Chem. Inf. Model., 2006, 46, 991-998

[ Abstract ]
[DOI 10.1021/ci050400b ]

The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group promoting reusable chemistry via open source software development, consistent and complimentary chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics algorithms drawing from our various software projects; a shared repository of chemoinformatics data including elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-independent use of chemoinformatics programs.
Scalable Partitioning and Exploration of Chemical Spaces using Geometric Hashing
Dutta, D.; Guha, R.; Jurs, P.C.; Chen, T.
J. Chem. Inf. Model., 2006, 46, 321-333

[ Abstract ]
[DOI 10.1021/ci050403o ]

Virtual screening (VS) has become a preferred tool to augment high-throughput screening1 and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249,071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.
Generating, Using and Visualizing Molecular Information in R
Guha, R.
R News, 2006, 3, 28-33

[ Abstract ]
[ Link ]

Validation of the CDK Surface Area Routine
Guha, R.
CDK News, 2006, 3, 5-9
Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bioinformatics
Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E.L.
Curr. Pharm. Des., 2006, 12, 2110-2120

[ Abstract ]
[DOI 10.2174/138161206777585274 ]

The Chemistry Development Kit (CDK) provides methods for common tasks in molecular informatics, including 2D and 3D rendering of chemical structures, I/O routines, SMILES parsing and generation, ring searches, isomorphism checking, structure diagram generation, etc. Implemented in Java, it is used both for server-side computational services, possibly equipped with a web interface, as well as for applications and client-side applets. This article introduces the CDK’s new QSAR capabilities and the recently introduced interface to statistical software.
Interpreting Computational Neural Network QSAR Models: A Detailed Interpretation of the Weights and Biases
Guha, R.; Stanton, D.T.; Jurs, P.C.
J. Chem. Inf. Model., 2005, 45, 1109-1121
Interpreting Computational Neural Network QSAR Models: A Measure of Descriptor Importance
Guha, R.; Jurs, P.C.
J. Chem. Inf. Model., 2005, 45, 800-806

[ Abstract ]
[DOI 10.1021/ci050022a ]

We present a method to measure the relative importance of the descriptors present in a QSAR model developed with a computational neural network (CNN). The approach is based on a sensitivity analysis of the descriptors. We tested the method on three published data sets for which linear and CNN models were previously built. The original work reported interpretations for the linear models, and we compare the results of the new method to the importance of descriptors in the linear models as described by a PLS technique. The results indicate that the proposed method is able to rank descriptors such that important descriptors in the CNN model correspond to the important descriptors in the linear model.
Determining the Validity of a QSAR Model–A Classification Approach
Guha, R.; Jurs, P.C.
J. Chem. Inf. Model., 2005, 45, 65-73

[ Abstract ]
[DOI 10.1021/ci0497511 ]

The determination of the validity of a QSAR model when applied to new compounds is an important concern in the field of QSAR and QSPR modeling. Various scoring techniques can be applied to specific types of models. We present a technique with which we can state whether a new compound will be well predicted by a previously built QSAR model. In this study we focus on linear regression models only, though the technique is general and could also be applied to other types of quantitative models. Our technique is based on a classification method that divides regression residuals from a previously generated model into a good class and bad class and then builds a classifier based on this division. The trained classifier is then used to determine the class of the residual for a new compound. We investigated the performance of a variety of classifiers, both linear and nonlinear. The technique was tested on two data sets from the literature and a hand built data set. The data sets selected covered both physical and biological properties and also presented the methodology with quantitative regression models of varying quality. The results indicate that this technique can determine whether a new compound will be well or poorly predicted with weighted success rates ranging from 73% to 94% for the best classifier.
Using R to Provide Statistical Functionality for QSAR Modeling in CDK to Provide Statistical Functionality for QSAR Modeling in CDK
Guha, R.
CDK News, 2005, 2, 7-13

[ Abstract ]
[ Link ]

Using the CDK as a Backend to R
Guha, R.
CDK News, 2005, 2, 2-6

[ Abstract ]
[ Link ]

Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors.
Guha, R.; Jurs, P.C.
J. Chem. Inf. Comput. Sci., 2004, 44, 2179-2189

[ Abstract ]
[DOI 10.1021/ci049849f ]

A QSAR modeling study has been done with a set of 79 piperazyinylquinazoline analogues which exhibit PDGFR inhibition. Linear regression and nonlinear computational neural network models were developed. The regression model was developed with a focus on interpretative ability using a PLS technique. However, it also exhibits a good predictive ability after outlier removal. The nonlinear CNN model had superior predictive ability compared to the linear model with a training set error of 0.22 log(IC50) units (R2 = 0.93) and a prediction set error of 0.32 log(IC50) units (R2 = 0.61). A random forest model was also developed to provide an alternate measure of descriptor importance. This approach ranks descriptors, and its results confirm the importance of specific descriptors as characterized by the PLS technique. In addition the neural network model contains the two most important descriptors indicated by the random forest model.
The Development of QSAR Models To Predict and Interpret the Biological Activity of Artemisinin Analogues
Guha, R.; Jurs, P.C.
J. Chem. Inf. Comput. Sci., 2004, 44, 1440-1449

[ Abstract ]
[DOI 10.1021/ci0499469 ]

This work presents the development of Quantitative Structure-Activity Relationship (QSAR) models to predict the biological activity of 179 artemisinin analogues. The structures of the molecules are represented by chemical descriptors that encode topological, geometric, and electronic structure features. Both linear (multiple linear regression) and nonlinear (computational neural network) models are developed to link the structures to their reported biological activity. The best linear model was subjected to a PLS analysis to provide model interpretability. While the best linear model does not perform as well as the nonlinear model in terms of predictive ability, the application of PLS analysis allows for a sound physical interpretation of the structure-activity trend captured by the model. On the other hand, the best nonlinear model is superior in terms of pure predictive ability, having a training error of 0.47 log RA units (R2 = 0.96) and a prediction error of 0.76 log RA units (R2 = 0.88).
Generation of QSAR Sets with a Self-Organizing Map.
Guha, R.; Serra, J.R.; Jurs, P.C.
J. Mol. Graph. Model., 2004, 23, 1-14

[ Abstract ]
[DOI 10.1016/j.jmgm.2004.03.003 ]

A Kohonen self-organizing map (SOM) is used to classify a data set consisting of dihydrofolate reductase inhibitors with the help of an external set of Dragon descriptors. The resultant classification is used to generate training, cross-validation (CV) and prediction sets for QSAR modeling using the ADAPT methodology. The results are compared to those of QSAR models generated using sets created by activity binning and a sphere exclusion method. The results indicate that the SOM is able to generate QSAR sets that are representative of the composition of the overall data set in terms of similarity. The resulting QSAR models are half the size of those published and have comparable RMS errors. Furthermore, the RMS errors of the QSAR sets are consistent, indicating good predictive capabilities as well as generalizability.

Written by Rajarshi Guha

June 29th, 2013 at 1:11 pm

Posted in

Predictive models – Implementation vs Specification

with one comment

Benjamin Good recently asked about the existence of public repositories of predictive molecular signatures. From his description, he’s looking for platforms that are capable of deploying predictive models. The need for something like this is certainly not restricted to genomics – the QSAR field has been in need for this for many years. A few years back I described a system to deploy R models and more recently the OCHEM platform attempts to address this. Pipelining tools usually have a web deployment mode that also supports this idea. One problem faced by such platforms in the cheminformatics area is that the deployed model must include the means to evaluate the input features (a.k.a., descriptors). Depending on the licenses associated with descriptor software such a bundle may not be easily deployed. A gene-based predictor obviously doesn’t suffer from this problem, so it should be easier to implement. Benjamin points out the Synapse platform which looks quite nice, but only supports R models (not necessarily a bad thing!). A very recent candidate for generic predictive model (amongst other things) deployment is via plugins for the BARD platform.

But in my mind, the deeper issue that should be addressed is that of model specification. With a robust specification, evaluation of the model could implemented in arbitrary languages and platforms – essentially decoupling model definition and model implementation. PMML is one approach to predictive model specifications and is quite general (and a good solution for the gene predictor models that Benjamin is interested in). A field-specific example would be QSAR-ML (also see here) for QSAR models. One could then imagine repositories of model specifications, with an ecosystem of tools and services that instantiate models from these specs.

Written by Rajarshi Guha

May 1st, 2013 at 12:29 am

Notes & thoughts from the IU semantics workshop

without comments

Over the last two days I attended a workshop titled Exploiting Big Data Semantics for Translational Medicine, held at Indiana University, organized by David Wild, Ying Ding, Katy Borner and Eric Gifford. The stated goals were to explore advances in translation medicine via data and semantic technologies, with a view towards possible fundable ideas and funding opportunites. A nicely arranged workshop that was pretty intense – minimal breaks, constant thinking – which is a good use of 2 days. As you can see from the workshop website, the attendees brought a variety of skills and outlooks to the meeting. For me this was one of the most attractive features of the workshop.

This post is a rough dump of some observations & thoughts during the workshop – I’m sure I’ve left out important comments, provide minimal attribution and I assume there will be a more thorough report coming out from the organizers. I also point out that I am an interested bystander to this field and somewhat of a semantic web/technology (SW/T) skeptic – so some views may be naive or just wrong. I like the ideas and concepts, I can see their value, but I have not been convinced to invest significant time and efort into “semantifying” my day to day work. A major motivation for my attending this workshop was to learn what the experts are doing and see how I could incorporate some of these ideas into my own work.

The Meeting

The first day started with 5 minute introductions which was quite useful and great overview talks by three of the attendees. With the information dump, a major focus of the day was a discussion of opportunities and challenges. This was a very useful session with attendees listing specific instances of challenges, opportunities, bottlenecks and so on. I was able to take some notes on the challenges, including

  • Funding – lack of it and difficulty in obtaining it (i.e., persuading funders)
  • Cultural and social issues around semantic approaches (e.g., why change what’s already working? etc)
  • Data problems such as errors being propagated through ontologies and semantic conversion processes etc (I wonder to what extent this is a result of automated conversion processes such as D2R, versus manual errors introduced during curation. I suspect a mix of both)
  • Hilbert Problems” – a very nice term coined by Katy to represent grand challenges or open problems that could serve as seeds around which the community could nucleate. (This aspect was of particular interest as I have found it difficult to identify compelling life science use cases that justify a retooling (even partial) of current workflows.)

The second day focused on breakout sessions, based on the opportunities and challenges listed the day before. Some notes on some of these sessions:

Bridging molecular data and clinical data – this session focused on challenges and opportunities in using molecular data together with clinical data to inform clinical decision making. Three broad opportunies came out of this, viz., Advancing understanding of disease conditions, Optimizing data types/measurements for clinical decision making outcomes and Drug repurposing. Certainly very broad goals, and not particularly focused on SW/T. My impression that SW/T can play an important role in standardization and optimization of coding standards to more easily and robustly connect molecular and clinical data sources. But one certainly needn’t invoke SW/T to address these opportunities

Knowledge discovery – the considerations addressed by this group included the fact that semantified data (vocabularies, ontologies etc) is increasing in volume and availability, tools are available to go from raw data to semantified forms and so on. An important point was made the quality is a key consideration at multiple levels – the raw data, the semantic representation and the links between semantic entities. A challenge identified by this group was to identify use cases that SW/T can resolve and traditional technologies cannot.

RDBMS vs semantic databases – this was an interesting session that tried to address the question of when one type of database is better than the other. It seems that the consensus was that certain problems are better suited for one type over another and a hybrid solution is usually a sensible approach – but that goes without saying. A comment was made that certain classes of problems that involve identifying paths between terms (nodes) are better suited for semantic (graph) databases – this makes intuitive sense, but there was a consensus that there weren’t any realistic applications that one could point to. I like the idea – have attributes in a RDBMS, but links in a graph database and use graph queries to identify relations and entities that are then mapped to the RDBMS. My concern with this is that path traversals are easy (Neo4J does this quite efficiently) – the problem is in the explosion of possible paths between nodes and the fact that the majority of them are trivial at best or nonsensical at worst. This suggests that relevance/ranking is a concern in semantic/graph databases.

The session of most interest to me was that of grand challenges. I think we got to 5  or 6 major challenges

  • How to represent knowledge (methods for, evaluation of)
  • How do changes in ontologies affect scientific research over time
  • How to construct an ontology from a set of ontologies (i.e. preexisting knowledge) that is better than the individual ones (and so links to how to evaluate an ontology in terms of “goodness”)
  • Error propagation from measurements to representation to analysis
  • Visualization of multi-dimensional / high dimensional data – while a general challenge, I think it’s correct that visual representations of semantified data (and their supporting infrastructure such as ontologies) could make the methods and tools much more accessible. Would’ve been nice if we had more discussion on this aspect

We finally ended with a discussion concrete projects that attendees would be interested in collaborating on and this was quite fruitful.

My Opinion

It turns out that a good chunk of the discussion focused on translational medicine (clinical informatics, drug repurposing etc.) and the use of different data types to enable life science research, but largely independent of SW/T. Indeed, the role of SW/T seemed rather fuzzy at times – to some extent, a useful tool, but not indispensable. My impression was that much of the SW/T that was discussed really focused on labeling of knowledge via ontologies and making links between datasets and the challenges faced during these operations (which is fine and important – but does it justify funding?).

I certainly got some conflicting views of the state of the art. Comments from Amit Sheth made it appear that SW/T is well established and the main problems are solved, based on deployed applications in the “enterprise”. But comments from many of the attendes working in life sciences suggested many problems in dealing and working with semantic data. Sure, Google has it’s Knowledge Graph and other search engines are employing SW/T under the hood. But if it’s so well established, where are the products, tools and workflows that an informatics-savvy non-expert in SW/T can employ? Does this mean research funding is not really required and it’s more of a productization/monetization issue? Or is this a domain specific issue – what works for general search doesn’t necessarily work in the life sciences?

My fundamental issue is the absence of a “killer application” – an application or use case that gives a non-trivial result, that could not be achieved via traditional means. (I qualify this, by asking for such use cases in life sciences. Maybe bankers have already found their killer applications). Depending on the semantic technology one considers there are partial answers: ontologies are an example of such a use case, when used to enable linkages between datasets and sources across domains. To me this makes perfect sense (and is of particular interest and use in current projects such as BARD). But surely, there must be more than designing ontologies and annotating data with ontological terms? One of the things that was surprising to me was that some of the future problems that were considered for possible collaborations were not really dependent on SW/T – in other words, they could largely be addressed via pre-existing methodologies.

My (admittedly cursory) reading of the SW/T literature seems to suggest to me that a major promise of this field is “reasoning” over my data. And I’m waiting for non-trivial assertions made based on linked data, ontologies and so on – that really highlight where my SQL tables will fail. It’s not sufficient (to me) to say that what took me 50 lines of Python code takes you 2 lines of SPARQL – I have an investment made in my RDBMS, API’s and codebase and yes, it takes a bit more fiddling – but I can get my answer in 5 minutes because it’s already been set up.

Some points were made regarding challenges faced by SW/T including complexity of OWL, difficulty in leaning SPARQL, poor performanec queries. Personally, these are not valid challenges and I certainlly do not make the claim that tricky SPARQL queries are preventing me from jumping into SW/T. I’m perfectly willing to wait 5 min for a SPARQL query to run, if the outcome is of sufficient value. The bigger issue for me is the value of the outcomes – maybe it’s just too early for truly novel, transformative results to be produced. Or maybe it’s simply one tool amongst others that can be used to tackle a certain class of problems.

Overall, it was a worthwhile two days interacting with a group of interesting people. But definitely some fuzziness in terms of what role SW/T can, should or will play in translational life science research.

Written by Rajarshi Guha

March 27th, 2013 at 7:26 pm

Competitive Predictive Modeling – How Useful is it?

with 5 comments

While at the ACS National Meeting in Philadelphia I attended a talk by David Thompson of Boehringer Ingelheim (BI), where he spoke about a recent competition BI sponsored on Kaggle – a web site that hosts data mining competitions. In this instance, BI provided a dataset that contained only object identifiers and about 1700 numerical features and a binary dependent variable. The contest was open to anybody and who ever got the best classification model (as measured by log loss) was selected as the winner. You can read more about the details of the competition and also on Davids’ slides.

But I’m curious about the utility of such a competition. During the competition, all contestents had access to were the numerical features. So the contestants had no idea of the domain from where the data came – placing the onus on pure modeling ability and no need for domain knowledge. But in fact the dataset provided to them, as announced by David at the ACS, was the Hansen AMES mutagenicity dataset characterized using a collection of 2D descriptors (continuous topological descriptors as well as binary fingerprints).

BI included some “default” models and the winning models certainly performed better (10% for the winning model). This is not surprising, as they did not attempt build optimized models. But then we also see that the top 5 models differed only incrementally in their log loss values. Thus any one of the top 3 or 4 models could be regarded as a winner in terms of actual predictions.

What I’d really like to know is how well such an approach leads to better chemistry or biology. First, it’s clear that such an approach leads to the optimization of pure predictive performance and cannot provide insight into why the model makes an active or inactive call. In many scenario’s this is sufficient, but more often than not, domain specific diagnostics are invaluable. Second, how does the relative increase in model performance lead to better decision making? Granted, the crowd-sourced, gamified approach is a nice way to eke out the last bits of predictive performance on a dataset – but does it really matter that one model performs 1% better than the next best model? The fact that the winning model was 10% better than the “default” BI model is not too informative. So a specific qustion I have is, was there a benefit, in terms of model performance, and downstream decision making by asking the crowd for a better model, compared to what BI had developed using (implicit or explicit) chemical knowledge?

My motivation is to try and understand whether the winning model was an incremental improvement or whether it was a significant jump, not just in terms of numerical performance, but in terms of the predicted chemistry/biology. People have been making noises of how data trumps knowledge (or rather hypotheses and models) and I believe that in some cases this can be true. But I also wonder to what extent this holds for chemical data mining.

But it’s equally important to understand what such a model is to be used for. In a virtual screening scenario, one could probably ignore interpretability and go for pure predictive performance. In such cases, for increasingly large libraries, it might make sense for one to have a model that s 1% better than the state of the art. (In fact, there was a very interesting talk by Nigel Duffy of Numerate, where he spoke about a closed form, analytical expression for the hit rate in a virtual screen, which indicates that for improvements in the overall performance of a VS workflow, the best investment is to increase the accuracy of the predictive model. Indeed, his results seem to indicate that even incremental improvements in model accuracy lead to a decent boost to the hit rate).

I want to stress that I’m not claiming that BI (or any other organization involved in this type of activity) has the absolute best models and that nobody can do better. I firmly believe that however good you are at something, there’s likely to be someone better at it (after all, there are 6 billion people in the world). But I’d also like to know how and whether incrementally better models do when put to the test of real, prospective predictions.

Written by Rajarshi Guha

August 22nd, 2012 at 9:02 pm