So much to do, so little time

Trying to squeeze sense out of chemical data

Search Result for rest — 110 articles

Exploring co-morbidities in medical case studies

with 2 comments

A previous post described a first look at the data available in casesdatabase.com, primarily looking at summaries of high level meta-data. In this post I start looking at the cases themselves. As I noted previously, BMC has performed some form of biomedical entity recognition on the abstracts (?) of the case studies, resulting in a set of keywords for each case study. The keywords belong to specific types such as Condition, Medication and so on. The focus of this post will be to explore the occurrence of co-morbidities – which conditions occur together, to what extent and whether such occurrences are different from random. The code to extract the co-morbidity data and generate the analyses below is available in co-morbidity.py

Before doing any analyses we need to do some clean up of the Condition keywords. This includes normalizing terms (replacing ‘comatose’ with ‘coma’, converting all diabetes variants such as Type 1 and Type 2 to just diabetes), fixing spelling variants (replacing ‘foetal’ with ‘fetal’), removing stopwords and so on. The Python code to perform this clean up requires that we manually identify these transformations. I haven’t done this rigorously, so it’s not a totally cleansed dataset. The cleanup code looks like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def cleanTerms(terms):
    repMap = {'comatose':'coma',
              'seizures':'seizure',
              'foetal':'fetal',
              'haematomas':'Haematoma',
              'disorders':'disorder',
              'tumour':'tumor',
              'abnormalities':'abnormality',
              'tachycardias':'tachycardias',
              'lymphomas': 'lymphoma',
              'tuberculosis':'tuberculosis',
              'hiv':'hiv',
              'anaemia':'anemia',
              'carcinoma':'carcinoma',
              'metastases':'metastasis',
              'metastatic':'metastasis',
              '?':'-'}
    stopwords = ['state','syndrome'', low grade', 'fever', 'type ii', 'mellitus', 'type 2', 'type 1', 'systemic', 'homogeneous', 'disease']
    l = []
    term = [x.lower().strip() for x in terms]
    for term in terms:
        for sw in stopwords: term = term.replace(sw, '')
        for key in repMap.keys():
            if term.find(key) >= 0: term = repMap[key]
        term = term.encode("ascii", "ignore").replace('\n','').strip()
        l.append(term)
    l = filter(lambda x: x != '-', l)
    return(list(set(l)))

Since each case study can be associated with multiple conditions, we generate a set of unique condition pairs for each case, and collect these for all 28K cases I downloaded previously.

1
2
3
4
5
6
7
8
9
10
cases = pickle.load(open('cases.pickle'))
allpairs = []
for case in cases:
    ## get all conditions for this case
    conds = filter(lambda x: x['type'] == 'Condition', [x for x in case['keywords']])
    conds = cleanTerms([x['text'] for x in conds])
    if len(conds) == 0: continue
    conds.sort()
    pairs = [ (x,y) for x,y in list(combinations(conds, 2))]
    allpairs.extend(pairs)

It turns out that across the whole dataset, there are a total of 991,466 pairs of conditions corresponding to 576,838 unique condition pairs and 25,590 unique conditions. Now, it’s clear that some condition pairs may be causally related (some of which are trivial cases such as cough and infection), whereas others are not. In addition, it is clear that some condition pairs are related in a semantic, rather than causal, fashion – carcinoma and cancer. In the current dataset we can’t differentiate between these classes. One possibility would be to code the conditions using ICD10 and collapse terms using the hierarchy.

Number of co-morbidities vs frequency of occurrence

Number of co-morbidities vs frequency of occurrence

Having said that, we work with what we currently have – and it’s quite sparse. In fact the 28K case studies represent just 0.16% of all possible co-morbidities. Within the set of just under 600K unique observed co-morbidities, the bulk occur just once. For the rest of the analysis we ignore these singleton co-morbidities (leaving us with 513,997  co-morbidities). It’s interesting to see the distribution of frequencies of co-morbidities. The first figure plots the number of co-morbidities that occur at least N times – 99,369 co-morbidities occur 2 or more times in the dataset and so on.

Another way to visualize the data is to plot a pairwise heatmap of conditions. For pairs of conditions that occur in the cases dataset we can calculate the probability of occurrence (i.e., number of times the pair occurs divided by the number of pairs). Furthermore, using a sampling procedure we can evaluate the number of times a given pair would be selected randomly from the pool of conditions. For the current analysis, I used 1e7 samples and evaluated the probability of a co-morbidity occurring by chance. If this probability is greater than the observed probability I label that co-morbidity as not different from random (i.e., insignificant). Ideally, I would evaluate a confidence interval or else evaluate the probability analytically (?).

For the figure below, I considered the 48 co-morbidities (corresponding to 25 unique conditions) that occurred 250 or more times in the dataset. I display the lower triangle for the heatmap – grey indicates no occurrences for a given co-morbidity and white X’s identify co-morbidities that have a non-zero probability of occurrence but are not different from random. As noted above, some of these pairs are not particularly informative – for example, tumor and metastasis occur with a relatively high probability, but this is not too surprising

Probability of occurrence for co-morbidities occurring more than 250 times

Probability of occurrence for co-morbidities occurring more than 250 times

It’s pretty easy to modify co-morbidity.py to look at other sets of co-morbidities. Ideally, however, we would precompute probabilities for all co-morbidities and then support interactive visualization (maybe using D3).

It’s also interesting to look at co-morbidities that include a specific condition. For example, lets consider tuberculosis (and all variants). There are 948 unique co-morbidities that include tuberculosis as one of the conditions. While the bulk of them occur just twice, there are a number with relatively large frequencies of occurrence – lymphadenopathy co-occurs with tuberculosis 203 times. Rather than tabulate the co-occurring conditions, we can use the frequencies to generate a word cloud, as shown below. As with the co-morbidity heatmaps, this could be easily automated to support interactive exploration. On a related note, it’d be quite interesting to compare the frequencies discussed here with data extracted from a live EHR system

A visualization of conditions most frequently co-occurring with tuberculosis

A visualization of conditions most frequently co-occurring with tuberculosis

So far this has been descriptive – given the size of the data, we should be able to try out some predictive models. Future posts will look at the possibilities of modeling the case studies dataset.

Written by Rajarshi Guha

October 12th, 2013 at 10:43 pm

Exploring medical case studies

with one comment

I recently came across http://www.casesdatabase.com/ from BMC, a collection of more than 29,000 peer-reviewed case studies collected from a variety of journals. I’ve been increasingly interested in the possibilities of mining clinical data (inspired by impressive work from Atul Butte, Nigam Shah and others), so this seemed like a great resource to explore

The folks at BMC have provided a REST API, which is still in development – as a result, there’s no public documentation and it still has a few rough edges. However, thanks to help from Demitrakis Kavallierou, I was able to interact with the API and extract summary search information as well as 28,998 case studies as of Sept 23, 2013. I’ve made the code to extract case studies available as proc.py. Running this, gives you two sets of data.

  1. A JSON file for each year between 2000 and 2014, containing the summary results for all cases in that year which includes a summary view of the case, plus facets for a variety of fields (age, condition, pathogen, medication, intervention etc.)
  2. A pickle file containing the case reports, as a list of maps. The case report contains the full abstract, case report identifier and publication meta-data.

A key feature of the case report entries is that BMC has performed some form of entity recognition so that it provides a list of keywords identified by different types: ‘Condition’, ‘Symptom’, ‘Medication’ etc. Each case may have multiple occurences for each type of keyword and importantly, each keyword is associated with the text fragment it is extracted from. As an example consider case 10.1136/bcr.02.2009.1548. The entry extracts two conditions

1
2
3
{u'sentence': u'She was treated by her family physician for presumptive interscapular myositis with anti-inflammatory drugs, cold packs and rest.',
u'text': u'Myositis',
u'type': u'Condition'}

and

1
2
3
{u'sentence': u'The patient denied any constitutional symptoms and had no cough.',
u'text': u'Cough',
u'type': u'Condition'}

I’m no expert in biomedical entity recognition, but the fact that BMC has performed it, saves me from having to become one, allowing me to dig into the data. But there are the usual caveats associated with text mining – spelling variants, term variants (insulin and insulin therapy are probably equivalent) and so on.

Count of cases deposited per year

Count of cases deposited per year

However, before digging into the cases themselves, we can use the summary data, and especially the facet information (which is, by definition, standardized) to get some quick summaries from the database. For example we see the a steady increase of case studies deposited in the literature over the last decade or so.

Interestingly, the number of unique conditions, medications or pathogens reported for these case studies is more or less constant, though there seems to be a downward trend for conditions. The second graph highlights this trend, by plotting the number of unique facet terms (for three types of facets) per year, normalized by the number of cases deposited that year.

Normalized count of unique facet terms by year

Normalized count of unique facet terms by year

This is a rough count, since I didn’t do any clean up of the text – so that misspellings of the same term (say, acetaminophen and acetaminaphen will be counted as two separate medication facets) may occur.

Another interesting task would be to enrich the dataset with additional annotations - ICD9/ICD10 for conditions, ATC for drugs – which would allow a higher level categorization and linking of case studies. In addition, one could use the CSLS service to convert medication names to chemical structures and employ structural similarity to group case studies.

The database also records some geographical information for each case. Specifically, it lists the countries that the authors are from. While interesting to an extent, it would have been nice if the country of occurrence or country of treatment were specifically extracted from the text. Currently, one might infer that the treatment occurred in the same country as the author is from, but this is likely only true when all authors are from the same country. Certainly, multinational collaborations will hide the true number of cases occurring in a given country (especially so for tropical diseases).

But we can take a look at how the number of cases reported for specific conditions, varies with geography and time. The figure below shows the cases whose conditions included the term tuberculosis

Tuberculosis cases by country and year

Tuberculosis cases by country and year

The code to extract the data from the pickle file is in condition_country.py. Assuming you have cases.pickle in your current path, usage is

1
$ python condition_country.py condition_name

and will output the data into a CSV file, which you can the process using your favorite tools.

In following blog posts, I’ll start looking at the actual case studies themselves. Interesting things to look at include exploring the propensity of co-morbidities, analysing the co-occurrence of conditions and medications or conditions and pathogens, to see whether the set of treatments associated with a given condition (or pathogen) has changed over time. Both these naturally lead to looking at the data with eye towards repurposing events.

Written by Rajarshi Guha

October 10th, 2013 at 7:20 pm

Life and death in a screening campaign

without comments

So, how do I enjoy my first day of furlough? Go out for a nice ride. And then read up on some statistics. More specifically, I was browsing the The R Book  and came across survival models. Such models are used to characterize time to events, where an event could be death of a patient or failure of a part and so on. In these types of models the dependent variable is the number of time units that pass till the event in question occurs. Usually the goal is to model the time to death (or failure) as a function of some properties of the individuals.

It occurred to me that molecules in a drug development pipeline also face a metaphorical life and death. More specifically, a drug development pipeline consists of a series of assays – primary, primary confirmation, secondary (orthogonal), ADME panel, animal model and so on. Each assay can be thought of as representing a time point in the screening campaign at which a compound could be discarded (“death”) or selected (“survived”) for further screening. While there are obvious reasons for why some compounds get selected from an assay and others do not (beyond just showing activity), it would be useful if we could quantify how molecular properties affect the number and types of compounds making it to the end of the screening campaign. Do certain scaffolds have a higher propensity of “surviving” till the in vivo assay? How does molecular weight, lipophilicity etc. affect a compounds “survival”? One could go up one level of abstraction and do a meta-analysis of screening campaigns where related assays would be grouped (so assays of type X all represent time point Y), allowing us to ask whether specific assays can be more or less indicative of a compounds survival in a campaign. Survival models allow us to address these questions.

How can we translate the screening pipeline to the domain of survival analysis? Since each assay represents a time point, we can assign a “survival time” to each compound equal to the number of assays it is tested in. Having defined the Y-variable, we must then select the independent variables. Feature selection is a never-ending topic so there’s lots of room to play. It is clear however, that descriptors derived from the assays (say ADMET related descriptors) will not be truly independent if those assays are part of the sequence.

Having defined the X and Y variables, how do we go about modeling this type of data? First, we must decide what type of survivorship curve characterizes our data. Such a curve characterizes the proportion of individuals alive at a certain time point. There are three types of survivorship curves: I, II and III corresponding to scenarios where individuals have a higher risk of death at later times, a constant risk of death and individuals have a higher risk of death at earlier times, respectively.

For the case of the a screening campaign, a Type III survivorship curve seems most appropriate. There are other details, but in general, they follow from the type of survivorship curve selected for modeling. I will note that the hazard function is an important choice to be made when using parametric models. There a variety of functions to choose from, but either require that you know the error distribution or else are willing to use trial and error. The alternative is to use a non-parametric approach. The most common approach for this class of models is the Cox proportional hazards model. I won’t go into the details of either approach, save to note that using a Cox model does not allow us to make predictions beyond the last time point whereas a parametric model would. For the case at hand, we are not really concerned with going beyond the last timepoint (i.e., the last assay) but are more interested in knowing what factors might affect survival of compounds through the assay sequence. So, a Cox model should be sufficient. The survival package provides the necessary methods in R.

OK – it sounds cute, but has some obvious limitations

  1. The use of a survival model assumes a linear time line. In many screening campaigns, the individual assays may not follow each other in a linear fashion. So either they must be collapsed into a linear sequence or else some assays should be discarded.
  2. A number of the steps represent ‘subjective selection’. In other words, each time a subset of molecules are selected, there is a degree of subjectivity involved – maybe certain scaffolds are more tractable for med chem than others or some notion of interesting combined with a hunch that it will work out. Essentially chemists will employ heuristics to guide the selection process – and these heuristics may not be fully quantifiable. Thus the choice of independent variables may not capture the nuances of these heuristics. But one could argue that it is possible the model captures the underlying heuristics via proxy variables (i.e., the descriptors) and that examination of those variables might provide some insight into the heuristics being employed.
  3. Data size will be an issue. As noted, this type of scenario requires the use of a Type III survivorship curve (i.e., most death occurs at earlier times and the death rate decreases with increasing time). However, decrease in death rate is extremely steep – out of 400,000 compounds screened in a primary assay, maybe 2000 will be cherry picked for confirmation and about 50 molecules may be tested in secondary, orthogonal assays. If we go out further to ADMET and in vivo assays, we may have fewer than 10 compounds to work with. At this stage I don’t know what effect such a steeply decreasing survivorship curve would have on the model.

The next step is to put together a dataset to see what we can pull out of a survival analysis of a screening campaign.

Written by Rajarshi Guha

October 2nd, 2013 at 10:22 pm

Learning Representations – Digits, Cats and Now Molecules

with 3 comments

Deep learning has been getting some press in the last few months, especially with the Google paper on recognizing cats (amongst other things) from Youtube videos. The concepts underlying this machine learning approach have been around for many years, though recent work by Hinton and others have led to fast implementations of the algorithms as well as better theoretical understanding.

It took me a while to realize that deep learning is really about learning an optimal, abstract representation in an unsupervised fashion (in the general case), given a set of input features. The learned representation can be then used as input to any classifier. A key aspect to such learned representations is that they are, in general, agnostic with respect to the final task for which they are trained. In the Google “cat” project this meant that the final representation developed the concept of cats as well as faces. As pointed out by a colleague, Bengio et al have published an extensive and excellent review of this topic and Baldi also has a nice review on deep learning.

In any case, it didn’t take too long for this technique to be applied to chemical data. The recent Merck-Kaggle challenge was won by a group using deep learning, but neither their code nor approach was publicly described. A more useful discussion of deep learning in cheminformatics was recently published by Lusci et al where they develop a DAG representation of structures that is then fed to a recursive neural network (RNN). They then use the resultant representation and network model to predict aqueous solubility.

A key motivation for the new graph representation and deep learning approach was the observation

one cannot be certain that the current molecular descriptors capture all the relevant properties required for solubility prediction

A related motivation was that they desired to apply deep learning methods directly to the molecular graph, which in general, is of variable size compared to fixed length representations (fingerprints or descriptor sets). It’s an interesting approach and you can read the paper for more details, but a few things caught my eye:

  • The motivation for the DAG based structure description didn’t seem very robust. Shouldn’t a learned representation be discoverable from a set of real-valued molecular descriptors (or even fingerprints)? While it is possible that all the physical aspects of aquous solubility may not be captured in the current repetoire of molecular descriptors, I’d think that most aspects are. Certainly some characterizations may be too time consuming (QM descriptors) for a cheminformatics setting.
  • The results are not impressive, compared to pre-existing model for the datasets they used. This is all the more surprising given that the method is actually an ensemble of RNN’s. For example, in Table 2 the best RNN model has an R2 of 0.92 versus 0.91 for the pre-existing model (a 2D kernel). But R2 is usually a good metric for non-linear regression. But even the RMSE is only 0.03 units better than the pre-existing model.However, it is certainly true that the unsupervised nature of the representation learning step is quite attractive – this is evident in the case of the intrinsic solubility dataset, where they achieve similar results to the prior model. But the prior model employed a manually selected set of topological descriptors.
  • It would’ve been very interesting to look at the transferabilty of the learned representation by using it to predict another physical property unrelated (at least directly) to solubility.

One characteristic of deep learning methods is that they work better when provided a lot of training data. With the exception of the Huuskonen dataset (4000 molecules), none of the datasets used were very large. If training set size is really an issue, the Burnham solubility dataset with 57K observations would have been a good benchmark.

Overall, I don’t think the actual predictions are too impressive using this approach. But the more important aspect of the paper is the ability to learn an internal representation in an unsupervised manner and the promise of transferability of such a representation. In a way, it’d be interesting to see what an abstract representation of a molecule could be like, analogous to what a deep network thinks a cat looks like.

Written by Rajarshi Guha

July 2nd, 2013 at 2:41 am

Publications

without comments

67. The ATR Inhibitor VE-821 in Combination with Topoisomerase I Inhibitors Kills Cancer Cells by Disabling DNA Replication Initiation and Fork Elongation
Josse, R.; S.E., M.; Guha, R.; Ormanoglu, P.; Pfister, T.; Morris, J.; Doroshow, J.; Pommier, Y.
Cancer Research, 2014, submitted
Camptothecin, a specific topoisomerase I inhibitor is a potent anticancer drug, especially against solid tumors. This agent produces well-characterized double-strand breaks upon collision of replication forks with topoisomerase I cleavage complexes. In an attempt to improve its efficacy, we conducted a synthetic lethal siRNA screening using a library that targets nearly 7000 human genes. Depletion of ATR, the main transducer of replication stress-induced DNA damage response exacerbated cytotoxic response to both camptothecin and the indenoisoquinoline LMP-400, a novel class of topoisomerase inhibitors in clinical trial. Inhibition of ATR by the recently developed specific inhibitor VE-821 induced synergistic antiproliferative activity when combined with either topoisomerase inhibitor. Cytotoxicity induced by the combination with LMP-400 was greater than with camptothecin. Using single cell analysis and DNA fiber spread, we show that VE-821 abrogated the S-phase checkpoint, restored origin firing and replication fork progression. Moreover, the combination of a topoisomerase inhibitor with VE-821 inhibited the phosphorylation of ATR and ATR-mediated Chk1 phosphorylation but strongly induced ?H2AX. Single cell analysis revealed that ?H2AX pattern changed overtime from well-defined focus to a pan-nuclear staining. The change in ?H2AX pattern can be useful as a predictive biomarker to evaluate the efficacy of therapy. The key implication of our work is the clinical rationale it provides to evaluate the combination of indenoisoquinoline topoisomerase I inhibitors with ATR inhibitors.
66. Blockade of Oncogenic I$\kappa$B Kinase Activity in ABC DLBCL by Small Molecule BET Protein Inhibitors
Ceribelli, M.; Priscilla, K.; Shaffer, A.L.; Wright, G.; Yang, Y.; Mathews-Griner, L.A.; Guha, R.; Shinn, P.; Keller, J.M.; Liu, D.; Patel, P.R.; Ferrer, M.; Joshi, S.; Nerle, S.; Sandy, P.; Normant, E.; Thomas, C.J.; Staudt, L.M.
Proc. Natl. Acad. Sci., 2014, in press

[ Abstract ]
[DOI 10.1073/pnas.1411701111 ]

The activated B-cell–like (ABC) subtype of diffuse large B-cell lymphoma (DLBCL) is an aggressive cancer that can only be cured in roughly 40% of cases. These malignant cells rely on the NF-?B signaling pathway for survival. Here, we report that genetic or pharmacologic interference with bromodomain and extraterminal domain (BET) chromatin proteins reduces NF-?B activity and ABC DLBCL viability. Unexpectedly, the mechanism involves inhibition of I?B kinase, the key cytoplasmic enzyme that activates the NF-?B pathway. The NF-?B pathway in ABC DLBCL is activated by B-cell receptor signaling, which can be blocked by the BTK kinase inhibitor ibrutinib. BET inhibitors synergized with ibrutinib to decrease growth of ABC DLBCL tumors in mouse models. BET inhibitors should be evaluated in ABC DLBCL clinical trials.
65. Genome Editing-Enabled HTS Assays Expand Drug Target Pathways for Charcot-Marie-Tooth Disease
Inglese, J.; Dranchak, P.; Moran, P.; Jang, S.-W.; Cost, G.J.; Srinivasan, R.; Guha, R.; Martinez, N.; MacArthur, R.; Urnov, F.D.; Svaren, J.
ACS Chem. Biol., 2014, in press
Copy number variation resulting in excess PMP22 protein causes the peripheral neuropathy, Charcot-Marie-Tooth Disease, type 1A. To broadly interrogate chemically sensitive transcriptional pathways controlling PMP22 protein level, we used the targeting precision of TALEN-mediated genome editing to embed reporters within the genetic locus harboring the Peripheral Myelin Protein 22 (Pmp22) gene. Using a Schwann cell line with constitutively high endogenous levels of Pmp22 we obtained monoallelic insertion of secreted bioluminescent reporters with sufficient signal to enable a 1536-well assay. Our findings from the quantitative high-throughput screening (qHTS) of several thousand drugs and clinically investigated compounds using this assay design both overlapped and expanded results from a previous assay using a randomly inserted reporter gene controlled by a single regulatory element of the Pmp22 gene. A key difference was the identification of a kinase-controlled inhibitory pathway of Pmp22 transcription revealed by the activity of the Protein kinase C (PKC)-modulator bryostatin
64. A High-Throughput Assay for Small Molecule Destabilizers of the KRAS Oncoprotein
Carver, J.; Dexheimer, T.S.; Hsu, D.; Weng, M.T.; Guha, R.; Jadhav, A.; Simeonov, A.; Luo, J.
PLoS One, 2014, 9, e103836
63. On the Validity versus Utility of Activity Landscapes: Are All Activity Cliffs Statistically Significant?
Guha, R.; Medina-Franco, J.L.
J. Cheminf., 2014, 6,

[ Abstract ]
[DOI 10.1186/1758-2946-6-11 ]

Most work on the topic of activity landscapes has focused on their quantitative description and visual representation, with the aim of aiding navigation of SAR. Recent developments have addressed applications such as quantifying the proportion of activity cliffs, investigating the predictive abilities of activity landscape methods and so on. However, all these publications have worked under the assumption that the activity landscape models are “real” (i.e., statistically significant).
RESULTS:
The current study addresses for the first time, in a quantitative manner, the significance of a landscape or individual cliffs in the landscape. In particular, we question whether the activity landscape derived from observed (experimental) activity data is different from a randomly generated landscape. To address this we used the SALI measure with six different data sets tested against one or more molecular targets. We also assessed the significance of the landscapes for single and multiple representations.
CONCLUSIONS:
We find that non-random landscapes are data set and molecular representation dependent. For the data sets and representations used in this work, our results suggest that not all representations lead to non-random landscapes. This indicates that not all molecular representations should be used to a) interpret the SAR and b) combined to generate consensus models. Our results suggest that significance testing of activity landscape models and in particular, activity cliffs, is key, prior to the use of such models
62. An Overview of the Challenges in Designing, Integrating, and Delivering BARD: A Public Chemical-Biology Resource and Query Portal for Multiple Organizations, Locations, and Disciplines
de Souza, A.; Bittker, J.; Lahr, D.; Brudz, S.; Chatwin, S.; Oprea, T.I.; Waller, A.; Yang, A.; Southall, N.; Guha, R.; Schurer, S.; Vempati, U.; Southern, M.R.; Dawson, E.S.; Clemons, P.A.; Chung, T.D.Y.
J. Biomol. Screen., 2014, 19, 614-627

[ Abstract ]
[DOI 10.1177/1087057113517139 ]

Recent industry-academic partnerships involve collaboration among disciplines, locations, and organizations using publicly funded “open-access” and proprietary commercial data sources. These require the effective integration of chemical and biological information from diverse data sources, which presents key informatics, personnel, and organizational challenges. The BioAssay Research Database (BARD) was conceived to address these challenges and serve as a community-wide resource and intuitive web portal for public-sector chemical-biology data. Its initial focus is to enable scientists to more effectively use the National Institutes of Health Roadmap Molecular Libraries Program (MLP) data generated from the 3-year pilot and 6-year production phases of the Molecular Libraries Probe Production Centers Network (MLPCN), which is currently in its final year. BARD evolves the current data standards through structured assay and result annotations that leverage BioAssay Ontology and other industry-standard ontologies, and a core hierarchy of assay definition terms and data standards defined specifically for small-molecule assay data. We initially focused on migrating the highest-value MLP data into BARD and bringing it up to this new standard. We review the technical and organizational challenges overcome by the interdisciplinary BARD team, veterans of public- and private-sector data-integration projects, who are collaborating to describe (functional specifications), design (technical specifications), and implement this next-generation software solution.
61. High-Throughput Combinatorial Screening Identifies Drugs that Cooperate with Ibrutinib to kill ABC Diffuse Large B Cell Lymphoma Cells
Mathews, L.; Guha, R.; Shinn, P.; Young, R.A.; Keller, J.; Liu, D.; Goldlust, I.S.; Yasgar, A.; McKnight, C.; Boxer, M.B.; Duveau, D.; Jiang, J.K.; Michael, S.; Mierzwa, T.; Huang, W.; Walsh, M.J.; Mott, B.T.; Patel, P.R.; Leister, W.; Maloney, D.J.; LeClair, C.A.; Rai, G.; Jadhav, A.; Peyser, B.D.; Austin, C.P.; Martin, S.; Simeonov, A.; Ferrer, M.; Staudt, L.M.; Thomas, C.J.
Proc. Nat. Acad. Sci., 2014, 111, 2349-2354

[ Abstract ]
[DOI 10.1073/pnas.1311846111 ]

The clinical development of drug combinations is typically achieved through trial-and-error or via insight gained through a detailed molecular understanding of dysregulated signaling pathways in a specific cancer type. Unbiased small molecule combination (matrix) screening represents a high-throughput means to explore hundreds and even thousands of drug-drug pairs for potential investigation and translation. Here, we describe a high-throughput screening platform capable of testing compounds in pair-wise matrix blocks for the rapid and systematic identification of synergistic, additive and antagonistic drug combinations. Experimental details are provided for this platform including the software codes for a novel compound dispensing methodology and a web-based data interface. We utilize this platform to conduct a combination screen to determine drug-drug combinations for the Bruton’s tyrosine kinase (BTK) inhibitor ibrutinib (PCI-32765) against the activated B-cell-like subtype of diffuse large B-cell lymphoma (ABC DLBCL). The results of this study highlight a striking level of synergy/additivity between ibrutinib and inhibitors of the PI3K-AKT-mTOR signaling cascade including the PI3K inhibitor BKM-120, the AKT inhibitor MK-2206 and the mTOR inhibitor everolimus. We also found that ibrutinib had strong combination responses with chemotherapeutic components of the current standards of care for DLBCL including doxorubicin, gemcitabine and docetaxel.
60. Inhibition of Ceramide Metabolism Sensitizes Human Leukemia Cells to Inhibition of BCL2-like Proteins
Casson, L.; Howell, L.; Mathews, L.A.; Ferrer, M.; Southall, N.; Guha, R.; Keller, J.M.; Thomas, C.; Varmus, H.; Siskind, L.J.; Beverly, L.J.
PLoS One, 2013, 8, e54525

[ Abstract ]
[ Link ]

The identification of novel combinations of effective cancer drugs is required for the successful treatment of cancer patients for a number of reasons. First, many “cancer specific” therapeutics display detrimental patient side-effects and second, there are almost no examples of single agent therapeutics that lead to cures. One strategy to decrease both the effective dose of individual drugs and the potential for therapeutic resistance is to combine drugs that regulate independent pathways that converge on cell death. BCL2-like family members are key proteins that regulate apoptosis. We conducted a screen to identify drugs that could be combined with an inhibitor of anti-apoptotic BCL2-like proteins, ABT-263, to kill human leukemia cells lines. We found that the combination of D,L-threo-1-phenyl-2-decanoylamino-3-morpholino-1-propanol (PDMP) hydrochloride, an inhibitor of glucosylceramide synthase, potently synergized with ABT-263 in the killing of multiple human leukemia cell lines. Treatment of cells with PDMP and ABT-263 led to dramatic elevation of two pro-apoptotic sphingolipids, namely ceramide and sphingosine. Furthermore, treatment of cells with the sphingosine kinase inhibitor, SKi-II, also dramatically synergized with ABT-263 to kill leukemia cells and similarly increased ceramides and sphingosine. Data suggest that synergism with ABT-263 requires accumulation of ceramides and sphingosine, as AMP-deoxynojirimycin, (an inhibitor of the glycosphingolipid pathway) did not elevate ceramides or sphingosine and importantly did not sensitize cells to ABT-263 treatment. Taken together, our data suggest that combining inhibitors of anti-apoptotic BCL2-like proteins with drugs that alter the balance of bioactive sphingolipids will be a powerful combination for the treatment of human cancers.
59. Profile of the GSK Published Protein Kinase Inhibitor Set Across ATP-dependent and-independent Luciferases: Implications for Reporter-gene Assays
Dranchak, P.; MacArthur, R.; Guha, R.; Zuercher, W.J.; Drewry, D.H.; Auld, D.S.; Inglese, J.
PLoS One, 2013, 8,
58. Genome-wide high-content RNAi screens identify regulators of Parkin upstream of mitophagy
Hasson, S.; Kane, L.; Sliter, D.; Hessa, T.; Wang, C.; Buehler, E.; Guha, R.; Martin, S.; Yamano, K.; Huang, C.H.; Heman-Ackah, S.; Youle, R.
Nature, 2013, 504, 291-295

[ Abstract ]
[DOI 10.1038/nature12748 ]

An increasing body of evidence points to mitochondrial dysfunction as a contributor to the molecular pathogenesis of neurodegenerative diseases such as Parkinson’s. Recent studies of the PD-associated genes PINK1 and Parkin suggest that they may act in a quality control pathway preventing the accumulation of dysfunctional mitochondria. Here we elucidate regulators impacting Parkin translocation to damaged mitochondria with genome-wide siRNA screens coupled to high-content microscopy. Screening yielded gene candidates involved in diverse cellular processes that were subsequently validated in confirmatory assays. This led to characterization of TOMM7, as essential for stabilizing Pink1 on the outer mitochondrial membrane following mitochondrial damage. Additionally, we discovered HSPA1L (HSP70 family member) and BAG4 play mutually opposing roles in the regulation of Parkin translocation. The screens also revealed that SIAH3, found to localize to mitochondria, inhibits Pink1 accumulation after mitochondrial insult, reducing Parkin translocation. Overall, our screens provide a rich resource to understand mitochondrial quality control.
57. What are we “tweeting” About Obesity? Mapping Tweets with Topic Modeling and Geographic Information System
Ghosh, D.; Guha, R.
Cartography and GIS, 2013, 4, 90-102

[ Abstract ]
[DOI 10.1080/15230406.2013.776210 ]

Public health related tweets are difficult to identify in large conversational datasets like Twitter.com. Even more challenging is the visualization and analyses of the spatial patterns encoded in tweets. This study has the following objectives: how can topic modeling be used to identify relevant public health topics such as obesity on Twitter.com? What are the common obesity related themes? What is the spatial pattern of the themes? What are the research challenges of using large conversational datasets from social networking sites? Obesity is chosen as a test theme to demonstrate the effectiveness of topic modeling using Latent Dirichlet Allocation (LDA) and spatial analysis using Geographic Information System (GIS). The dataset is constructed from tweets (originating from the United States) extracted from Twitter.com on obesity-related queries. Examples of such queries are `food deserts’, `fast food’, and `childhood obesity’. The tweets are also georeferenced and time stamped. Three cohesive and meaningful themes such as `childhood obesity and schools’, `obesity prevention’, and `obesity and food habits’ are extracted from the LDA model. The GIS analysis of the extracted themes show distinct spatial pattern between rural and urban areas, northern and southern states, and between coasts and inland states. Further, relating the themes with ancillary datasets such as US census and locations of fast food restaurants based upon the location of the tweets in a GIS environment opened new avenues for spatial analyses and mapping. Therefore the techniques used in this study provide a possible toolset for computational social scientists in general, and health researchers in specific, to better understand health problems from large conversational datasets.
56. Targeting IRAK1 as a Novel Therapeutic Approach for Myelodysplastic Syndrome
Rhyasen, G.W.; Bolanos, L.; Fang, J.; Rasch, C.; Jerez, A.; Varney, M.; Wunderlicj, M.; Rigolino, C.; Mathews, L.; Ferrer, M.; Southall, N.; Guha, R.; Keller, J.; Thomas, C.; Beverly, L.J.; Agostino, C.; Oliva, E.N.; Cuzzola, M.; Maciejewski, J.P.; Mulloy, J.C.; Starczynowski, D.T.
Cancer Cell, 2013, 24, 90-104

[ Abstract ]
[DOI 10.1016/j.ccr.2013.05.006 ]

Myelodysplastic syndromes (MDSs) arise from a defective hematopoietic stem/progenitor cell. Consequently, there is an urgent need to develop targeted therapies capable of eliminating the MDS-initiating clones. We identified that IRAK1, an immune-modulating kinase, is overexpressed and hyperactivated in MDSs. MDS clones treated with a small molecule IRAK1 inhibitor (IRAK1/4-Inh) exhibited impaired expansion and increased apoptosis, which coincided with TRAF6/NF-?B inhibition. Suppression of IRAK1, either by RNAi or with IRAK1/4-Inh, is detrimental to MDS cells, while sparing normal CD34+ cells. Based on an integrative gene expression analysis, we combined IRAK1 and BCL2 inhibitors and found that cotreatment more effectively eliminated MDS clones. In summary, these findings implicate IRAK1 as a drugable target in MDSs.
55. Large-Scale Screening Identifies a Novel microRNA, miR-15a-3p, which Induces Apoptosis in Human Cancer Cell Lines
Druz, A.; Chen, Y.C.; Guha, R.; Betenbaugh, M.; Martin, S.; Shiloaoch, J.
RNAi Biology, 2013, 10, 1-14
MicroRNAs (miRNAs) have been found to be involved in cancer initiation, progression and metastasis and, as such, have
been suggested as tools for cancer detection and therapy. In this work, a large-scale screening of the complete miRNA
mimics library demonstrated that hsa-miR-15a-3p had a pro-apoptotic role in the following human cancer cells: heLa,
Aspc-1, MDA-MB-231, KB3, Me180, hcT-116 and A549. MiR-15a-3p is a novel member of the pro-apoptotic miRNA cluster,
miR-15a/16, which was found to activate caspase-3/7 and to cause viability loss in B/cMBA.Ov cells during preliminary
screening. subsequent microarrays and bioinformatics analyses identified the following four anti-apoptotic genes: bcl2l1,
naip5, fgfr2 and mybl2 as possible targets for the mmu-miR-15a-3p in B/cMBA.Ov cells. Follow-up studies confirmed the
pro-apoptotic role of hsa-miR-15a-3p in human cells by its ability to activate caspase-3/7, to reduce cell viability and to
inhibit the expression of bcl2l1 (bcl-x
L
) in heLa and Aspc-1 cells. MiR-15-3p was also found to reduce viability in heK293,
MDA-MB-231, KB3, Me180, hcT-116 and A549 cell lines and, therefore, may be considered for apoptosis modulating
therapies in cancers associated with high Bcl-x
L
expression (cervical, pancreatic, breast, lung and colorectal carcinomas).
The capability of hsa-miR-15a-3p to induce apoptosis in these carcinomas may be dependent on the levels of Bcl-x
L
expression. The use of endogenous inhibitors of bcl-x
L
and other anti-apoptotic genes such as hsa-miR-15a-3p may
provide improved options for apoptosis-modulating therapies in cancer treatment compared with the use of artificial
antisense oligonucleotides
54. Cisplatin Sensitivity Mediated by WEE1 and CHK1 is Mediated by miR-155 and the miR-15 Family
Pouliot, L.M.; Chen, Y.-C.; Bai, J.; Guha, R.; Martin, S.E.; Gottesman, M.M.; Hall, M.D.
Cancer Cell, 2012, 72, 5945-5955
53. Identification of Mammalian Protein Quality Control Factors by High-throughput Cellular Imaging
Pegoraro, G.; Voss, T.C.; Martin, S.E.; Tuzmen, P.; Guha, R.; Mistelli, T.
PLoS One, 2012, 7, e31684

[ Abstract ]
[DOI 10.1371/journal.pone.0031684 ]

Protein Quality Control (PQC) pathways are essential to maintain the equilibrium between protein folding and the clearance of misfolded proteins. In order to discover novel human PQC factors, we developed a high-content, high-throughput cell-based assay to assess PQC activity. The assay is based on a fluorescently tagged, temperature sensitive PQC substrate and measures its degradation relative to a temperature insensitive internal control. In a targeted screen of 1591 siRNA genes involved in the Ubiquitin-Proteasome System (UPS) we identified 25 of the 33 genes encoding for 26S proteasome subunits and discovered several novel PQC factors. An unbiased genome-wide siRNA screen revealed the protein translation machinery, and in particular the EIF3 translation initiation complex, as a novel key modulator of misfolded protein stability. These results represent a comprehensive unbiased survey of human PQC components and establish an experimental tool for the discovery of genes that are required for the degradation of misfolded proteins under conditions of proteotoxic stress.

52. High-Throughput Screening For Genes That Prevent Excess DNA Replication In Human Cells And For Molecules That Inhibit Them
Lee, C.; Johnson, R.L.; Wichterman-Kouznetsova, J.; Guha, R.; Ferrer, M.; Tuzmen, P.; Martin, S.; Zhu, W.; Depamphilis, M.L.
Methods, 2012, 57, 234-248

[ Abstract ]
[DOI 10.1016/j.ymeth.2012.03.031 ]

High-throughput screening (HTS) provides a rapid and comprehensive approach to identifying compounds that target specific biological processes as well as genes that are essential to those processes. Here we describe a HTS assay for small molecules that induce either DNA re-replication or endoreduplication (i.e. excess DNA replication) selectively in cells derived from human cancers. Such molecules will be useful not only to investigate cell division and differentiation, but they may provide a novel approach to cancer chemotherapy. Since induction of DNA re-replication results in apoptosis, compounds that selectively induce DNA re-replication in cancer cells without doing so in normal cells could kill cancers in vivo without preventing normal cell proliferation. Furthermore, the same HTS assay can be adapted to screen siRNA molecules to identify genes whose products restrict genome duplication to once per cell division. Some of these genes might regulate the formation of terminally differentiated polyploid cells during normal human development, whereas others will prevent DNA re-replication during each cell division. Based on previous studies, we anticipate that one or more of the latter genes will prove to be essential for proliferation of cancer cells but not for normal cells, since many cancer cells are deficient in mechanisms that maintain genome stability.
51. Cheminformatics, the Computer Science of Chemical Discovery Turning Open Source
Sterling, A.; Wegner, J.K.; Guha, R.; Bender, A.; Faulon, J.; Hastings, J.; O’Boyle, N.; Overington, J.P.; Vlijmen, H.V.; Willighagen, E.
Comm. ACM, 2012, 55, 65-75

[ Abstract ]
[DOI 10.1145/2366316.2366334 ]

One of the most prominent success stories in all the sciences over
the last decade has been the advance of bioinformatics: the
interdisciplinary collaboration between computer scientists and
molecular biologists that led to the
sequencing of the human genome and other accomplishments. However,
few computer scientists are familiar
with a related discipline: cheminformatics, the use of computers to
represent the structures of small molecules and analyze their
properties. Cheminformatics has wide applicability, from the
drug discovery to agrochemicals and materials design.
While researchers in both academia and industry have made important
contributions to this field for decades, new and
exciting collaborative opportunities have arisen from an “opening” of
data and software as an effect of changing mindsets,
policy changes,
and chemists volunteering time
for “Open Science”. Researchers have gained access to
freely available open source software packages and open databases of tens of millions
of chemicals, allowing academic chemists to confront a variety of
algorithmic problems whose solutions will be critical to address
current challenges ranging from
determining the behavior of small molecules in biological pathways,
to finding therapies for rare and neglected diseases.
In this paper, we give a broad overview of the field of cheminformatics with a
focus on open questions and challenges.
50. Exploring Uncharted Territories — Predicting Activty Cliffs in Structure-Activity Landscapes
Guha, R.
J. Chem. Inf. Model., 2012, 52, 2181-2191

[ Abstract ]
[DOI 10.1021/ci300047k ]

The notion of activity cliffs is an intuitive approach to characterizing structural features that
play a key role in modulating biological activity of a molecule. A variety of methods have been
described to quantitatively characterize activity cliffs, such as SALI and SARI. However, these
methods are primarily retrospective in nature; highlighting cliffs that are already present in the
dataset. The current study focuses on employing a pairwise characterization of a dataset to train
a model to predict whether a new molecule will exhibit an activity cliff with one or more members
of the dataset. The approach is based on predicting a value for pairs of objects rather than the
individual objects themselves (and thus allows for robust models even for small structure-activity
relationship datasets). We extracted structure-activity data for several ChEMBL assays and
developed random forest models to predict SALI values, from pairwise combinations of molecular
descriptors. The models exhibited reasonable RMSE’s though, surprisingly, performance on the more
significant cliffs tended to be better than on the lesser ones. While the models do not exhibit
very high levels of accuracy, our results indicate that they are able to prioritize molecules in
terms of their ability to activity cliffs, thus serving as a tool to prospectively identify
activity cliffs.
49. Dealing with the Data Deluge: Handling the Multitude of Chemical Biology Data Sources
Guha, R.; Nguyen, D.-T.; Southall, N.; Jadhav, A.
Curr. Protocols Chem. Biol., 2012, 4, 193-209

[ Abstract ]
[DOI 10.1002/9780470559277.ch110262 ]

Over the last 20 years, there has been an explosion in the amount and type of biological and chemical data that has been made publicly available in a variety of online databases. While this means that vast amounts of information can be found online, there is no guarantee that it can be found easily (or at all). A scientist searching for a specific piece of information is faced with a daunting task—many databases have overlapping content, use their own identifiers and, in some cases, have arcane and unintuitive user interfaces. In this overview, a variety of well-known data sources for chemical and biological information are highlighted, focusing on those most useful for chemical biology research. The issue of using data from multiple sources and the associated problems such as identifier disambiguation are highlighted. A brief discussion is then provided on Tripod, a recently developed platform that supports the integration of arbitrary data sources, providing users a simple interface to search across a federated collection of resources.
48. A Furoxan-Amodiaquine Hybrid as a Potential Therapeutic for Three Parasitic Diseases
Mott, B.T.; Cheng, C.C.; Guha, R.; Kommer, V.P.; Williams, D.L.; Vermeire, J.J.; Cappello, M.; Maloney, D.J.; Rai, G.; Jadhav, A.; Simeonov, A.; Inglese, J.; Posner, G.H; Thomas, C.J.
Med. Chem. Comm., 2012, 3, 1505-1511

[ Abstract ]
[DOI 10.1039/C2MD20238G ]

Parasitic diseases continue to have a devastating impact on human populations worldwide. Lack of effective treatments, the high cost of existing ones, and frequent emergence of resistance to these agents provide a strong argument for the development of novel therapies. Here we report the results of a hybrid approach designed to obtain a dual acting molecule that would demonstrate activity against a variety of parasitic targets. The antimalarial drug amodiaquine has been covalently joined with a nitric oxide-releasing furoxan to achieve multiple mechanisms of action. Using in vitro and ex vivo assays, the hybrid molecule shows activity against three parasites — Plasmodium falciparum, Schistosoma mansoni, and Ancylostoma ceylanicum.
47. Diversity-Oriented Synthesis Yields a Novel Lead for the Treatment of Malaria
Heidebrecht, R.W.; Mulrooney, C.; Austin, C.P.; Barker, R.H.; Beaudoin, J.A.; Cheng, K.Chih-Chien.; Comer, E.; Dandapani, S.; Dick, J.; Duvall, J.R.; Ekland, E.H.; Fidock, D.A.; Fitzgerald, M.E.; Foley, M.; Guha, R.; Hinkson, P.; Kramer, M.; Lukens, A.K.; Masi, D.; Marcaurelle, L.A.; Su, X.; Thomas, C.J.; Wewer, M.; Wiegand, R.C.; Wirth, D.; Xia, M.; Yuan, J.; Zhao, J.; Palmer, M.; Munoz, B.; Schreiber, S.
ACS Med. Chem. Lett., 2012, 3, 112-117

[ Abstract ]
[ Link ]

Here, we describe the discovery of a novel antimalarial agent using phenotypic screening of Plasmodium falciparum asexual blood-stage parasites. Screening a novel compound collection created using diversity-oriented synthesis (DOS) led to the initial hit. Structure–activity relationships guided the synthesis of compounds having improved potency and water solubility, yielding a subnanomolar inhibitor of parasite asexual blood-stage growth. Optimized compound 27 has an excellent off-target activity profile in erythrocyte lysis and HepG2 assays and is stable in human plasma. This compound is available via the molecular libraries probe production centers network (MLPCN) and is designated ML238.
46. Exploiting Synthetic Lethality for the Therapy of ABC Diffuse Large B Cell Lymphoma
Yang, Y.; Shaffer, A.; Emre, N. C. ?Tolga; Ceribelli, M.; Zhang, M.; Wright, G.; Xiao, W.; Powell, J.; Platig, J.; Kohlhammer, H.; Young, R.; Zhao, H.; Yang, Y.; Xu, W.; Buggy, J.; Balasubramanian, S.; Mathews, L.; Shinn, P.; Guha, R.; Ferrer, M.; Thomas, C.; Waldmann, T.; Staudt, L.
Cancer Cell, 2012, 21, 723-737

[ Abstract ]
[ Link ]

45. Exploring Structure-Activity Data Using the Landscape Paradigm
Guha, R.
WIREs Comput. Mol. Sci., 2012, 2, 829-841

[ Abstract ]
[DOI 10.1002/wcms.1087 ]

In this article, we present an overview of the origin and applications of the activity landscape view of structure–activity relationship (SAR) data as conceived by Shanmugasundaram and Maggiora. Within this landscape, different regions exemplify different aspects of SAR trends—ranging from smoothly varying trends to discontinuous trends (also termed activity cliffs). We discuss the various definitions of landscapes and cliffs that have been proposed as well as different approaches to the numerical quantification of a landscape. We then highlight some of the landscape visualization approaches that have been developed, followed by a review of the various applications of activity landscapes and cliffs to topics in medicinal chemistry and SAR analysis.
44. A 1536-well Quantitative High Throughput Screen to Identify Compounds Targeting Cancer Stem Cells
Mathews, L.A.; Keller, J.M.; Goodwin, B.; Guha, R.; Shinn, P.; Mull, R.; Thomas, C.; de Kluyver, R.; Sayers, T.; Ferrer, M.
J. Biomol. Screen., 2012, 17, 1231-1242

[ Abstract ]
[DOI 10.1177/1087057112458152 ]

Tumor cell subpopulations called cancer stem cells (CSCs) or tumor-initiating cells (TICs) have self-renewal potential and are thought to drive metastasis and tumor formation. Data suggest that these cells are resistant to current chemotherapy and radiation therapy treatments, leading to cancer recurrence. Therefore, finding new drugs and/or drug combinations that cause death of both the differentiated tumor cells as well as CSC populations is a critical unmet medical need. Here, we describe how cancer-derived CSCs are generated from cancer cell lines using stem cell growth media and nonadherent conditions in quantities that enable high-throughput screening (HTS). A cell growth assay in a 1536-well microplate format was developed with these CSCs and used to screen a focused collection of oncology drugs and clinical candidates to find compounds that are cytotoxic against these highly aggressive cells. A hit selection process that included potency and efficacy measurements during the primary screen allowed us to efficiently identify compounds with potent cytotoxic effects against spheroid-derived CSCs. Overall, this research demonstrates one of the first miniaturized HTS assays using CSCs. The procedures described here should enable further testing of the effect of compounds on CSCs and help determine which pathways need to be targeted to kill them.
43. A Survey of Quantitative Descriptions of Molecular Structure
Guha, R.; Willighagen, E.L.
Curr. Topics Med. Chem., 2012, 12, 1946-1956

[ Abstract ]
[DOI 10.2174/156802612804910278 ]

Numerical characterization of molecular structure is a first step in
many computational analysis of chemical structure data. These numerical
representations, termed descriptors, come in many forms, ranging
from simple atom counts and invariants of the molecular graph to
distribution of properties, such as charge, across a molecular
surface. In this article we first present a broad categorization of
descriptors and then describe applications and toolkits that can be
employed to evaluate them. We highlight a number of issues
surrounding molecular descriptor calculations such as versioning and
reproducibility and describe how some toolkits have attempted to
address these problems.
42. Genome-Wide RNAi Screen For Lysosomal Storage Disorders
Velayati, A.; Tuzmen, P.; Guha, R.; Martin, S.; Goldin, E.; Sidransky, E.
Molecular Genetics and Metabolism, 2012, 105, S63
41. Chemical Genomic Profiling for Antimalarial Therapies, Response Signatures, and Molecular Targets
Yuan, J.; Cheng, K.Chih-Chien.; Johnson, R.L.; Huang, R.; Pattaradilokrat, S.; Liu, A.; Guha, R.; Fidock, D.A.; Inglese, J.; Wellems, T.E.; Austin, C.P.; Su, X.
Science, 2011, 333, 724-729

[ Abstract ]
[DOI 10.1021/ci200081k ]

Malaria remains a devastating disease largely because of widespread drug resistance. New drugs and a better understanding of the mechanisms of drug action and resistance are essential for fulfilling the promise of eradicating malaria. Using high-throughput chemical screening and genome-wide association analysis, we identified 32 highly active compounds and genetic loci associated with differential chemical phenotypes (DCPs), defined as greater than or equal to fivefold differences in half-maximum inhibitor concentration (IC(50)) between parasite lines. Chromosomal loci associated with 49 DCPs were confirmed by linkage analysis and tests of genetically modified parasites, including three genes that were linked to 96\% of the DCPs. Drugs whose responses mapped to wild-type or mutant pfcrt alleles were tested in combination in vitro and in vivo, which yielded promising new leads for antimalarial treatments.
40. KNIME Workflow to Assess PAINS Filters in SMARTS Format. Comparison of RDKit and Indigo Cheminformatics Libraries
Saubern, S.; Guha, R.; Baell, J.B.
Mol. Inf., 2011, 30, 847-850
39. Open Data, Open Source and Open Standards in Chemistry: The Blue Obelisk Five Years On
O’Boyle, N.; Guha, R.; Willighagen, E.; Adams, S.E.; Alvarsson, J.; Bradley, J.C.; Filippov, I.; Hanson, R.M.; Hanwell, M.D.; Hutchison, G.R.; James, C.A.; Jeliazkova, N.; Lang, A.; Langner, K.M.; Lonie, D.C.; Lowe, D.M.; Pansanel, J.; Pavlov, D.; Spjuth, O.; Steinbeck, C.; Tenderholt, A.; Theisen, K.; Murray-Rust, P.
J. Cheminf., 2011, 3,

[ Abstract ]
[DOI 10.1186/1758-2946-3-37 ]

Background
The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards.

Results
This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry.

Conclusions
We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community.

38. Exploratory Analysis of Kinetic Solubility Measurements of a Small Molecule Library
Guha, R.; Dexheimer, T.S.; Kestranek, A.N.; Jadhav, A.; Chervenak, A.M.; Ford, M.G.; Simeonov, A.; Roth, G.P.; Thomas, C.J.
Bioorg. Med. Chem., 2011, 19, 4127-4134

[ Abstract ]
[DOI 10.1016/j.bmc.2011.05.005 ]

Kinetic solubility measurements using prototypical assay buffer conditions are presented for a ?58,000 member library of small molecules. Analyses of the data based upon physical and calculated properties of each individual molecule were performed and resulting trends were considered in the context of commonly held opinions of how physicochemical properties influence aqueous solubility. We further analyze the data using a decision tree model for solubility prediction and via a multi-dimensional assessment of physicochemical relationships to solubility in the context of specific ‘rule-breakers’ relative to common dogma. The role of solubility as a determinant of assay outcome is also considered based upon each compound’s cross-assay activity score for a collection of publicly available screening results. Further, the role of solubility as a governing factor for colloidal aggregation formation within a specified assay setting is examined and considered as a possible cause of a high cross-assay activity score. The results of this solubility profile should aid chemists during library design and optimization efforts and represent a useful training set for computational solubility prediction.
37. RNAi Screening Identifies TAK1 as a Potential Target for the Enhanced Efficacy of Topoisomerase Inhibitors
Martin, S.E.; Wu, Z.H.; Gehlhaus, K.Jones.; Zhang, Y.W.; Guha, R.; Miyamoto, S.; Pommier, Y.; Caplen, N.J.
Curr. Cancer Drug Targets, 2011, 11, 976-986

[ Abstract ]
[DOI 10.1002/cmdc.201100179 ]

In an effort to develop strategies that improve the efficacy of existing anticancer agents, we have conducted a siRNA-based RNAi screen to identify genes that, when targeted by siRNA, improve the activity of the topoisomerase I (Top1) poison camptothecin (CPT). Screening was conducted using a set of siRNAs corresponding to over 400 apoptosis-related genes in MDA-MB-231 breast cancer cells. During the course of these studies, we identified the silencing of MAP3K7 as a significant enhancer of CPT activity. Follow-up analysis of caspase activity and caspase-dependent phosphorylation of histone H2AX demonstrated that the silencing of MAP3K7 enhanced CPT-associated apoptosis. Silencing MAP3K7 also sensitized cells to additional compounds, including CPT clinical analogs. This activity was not restricted to MDA-MB-231 cells, as the silencing of MAP3K7 also sensitized the breast cancer cell line MDA-MB-468 and HCT-116 colon cancer cells. However, MAP3K7 silencing did not affect compound activity in the comparatively normal mammary epithelial cell line MCF10A, as well as some additional tumorigenic lines. MAP3K7 encodes the TAK1 kinase, an enzyme that is central to the regulation of many processes associated with the growth of cancer cells (e.g. NF-kB, JNK, and p38 signaling). An analysis of TAK1 signaling pathway members revealed that the silencing of TAB2 also sensitizes MDA-MB-231 and HCT-116 cells towards CPT. These findings may offer avenues towards lowering the effective doses of Top1 inhibitors in cancer cells and, in doing so, broaden their application.
36. Improving Usability and Accessibility of Cheminformatics Tools for Chemists Through Cyberinfrastructure and Education
Guha, R.; Wiggins, G.D.; Wild, D.J.; Baik, M.H.; Pierce, M.E.; Fox, G.C.
In Silico Biol., 2011, 11, 41-60
35. Discovery of New Antimalarial Chemotypes Through Chemical Methodology and Library Development
Brown, L.E.; Chih-Chien Cheng, K.; Wei, W.; Yuan, P.; Dai, P.; Trilles, R.; Ni, F.; Yuan, J.; MacArthur, R.; Guha, R.; Johnson, R.L.; Su, X.; Dominguez, M.M.; Snyder, J.K.; Beeler, A.B.; Schaus, S.E.; Inglese, J.; Porco, J.
Proc. Nat. Acad. Sci., 2011, 108, 6775-6780

[ Abstract ]
[DOI 10.1073/pnas.1017666108 ]

In an effort to expand the stereochemical and structural complexity of chemical libraries used in drug discovery, the Center for Chemical Methodology and Library Development at Boston University has established an infrastructure to translate methodologies accessing diverse chemotypes into arrayed libraries for biological evaluation. In a collaborative effort, the NIH Chemical Genomics Center determined IC(50)’s for Plasmodium falciparum viability for each of 2,070 members of the CMLD-BU compound collection using quantitative high-throughput screening across five parasite lines of distinct geographic origin. Three compound classes displaying either differential or comprehensive antimalarial activity across the lines were identified, and the nascent structure activity relationships (SAR) from this experiment used to initiate optimization of these chemotypes for further development.
34. Advances in Cheminformatics Methodologies and Infrastructure to Support the Data Mining of Large, Heterogeneous Chemical Datasets
Guha, R.; Gilbert, K.; Fox, G.C.; Pierce, M.; Wild, D.; Yuan, H.
Curr. Comp. Aid. Drug Des., 2010, 6, 50-67
In recent years, there has been an explosion in the availability of publicly accessible chemical information, including chemical structures of small molecules, structure-derived properties and associated biological activities in a variety of assays. These data sources present us with a significant opportunity to develop and apply computational tools to extract and understand the underlying structure-activity relationships. Furthermore, by integrating chemical data sources with biological information (protein structure, gene expression and so on), we can attempt to build up a holistic view of the effects of small molecules in biological systems. Equally important is the ability for non-experts to access and utilize state of the art cheminformatics method and models. In this review we present recent developments in cheminformatics methodologies and infrastructure that provide a robust, distributed approach to mining large and complex chemical datasets. In the area of methodology development, we highlight recent work on characterizing structure-activity landscapes, QSAR model domain applicability and the use of chemical similarity in text mining. In the area of infrastructure, we discuss a distributed web services framework that allows easy deployment and uniform access to
computational (statistics, cheminformatics and computational chemistry) methods, data and models. We also discuss the development of PubChem derived databases and highlight techniques that allow us to scale the infrastructure to extremely large compound collections, by use of distributed processing on Grids. Given that the above work is applicable to arbitrary types of cheminformatics problems, we also present some case studies related to virtual screening for anti-malarials and predictions of anti- cancer activity.
33. Use of Genetic Algorithm and Neural Network Approaches for Risk Factor Selection: A Case Study of West Nile Virus Dynamics in an Urban Environment
Ghosh, D.; Guha, R.
Computers, Environment and Urban Systems, 2010, 34, 189-203

[ Abstract ]
[DOI 10.1016/j.compenvurbsys.2010.02.007 ]

The West Nile virus (WNV) is an infectious disease spreading rapidly throughout the United States, causing illness among thousands of birds, animals, and humans. Yet, we only have a rudimentary understanding of how the mosquito-borne virus operates in complex avian–human environmental systems coupled with risk factors. The large array of multidimensional risk factors underlying WNV incidences is environmental, built-environment, socioeconomic, and existing mosquito abatement policies. Therefore it is essential to identify an optimal number of risk factors whose management would result in effective disease prevention and containment. Previous models built to select important risk factors assumed a priori that there is a linear relationship between these risk factors and disease incidences. However, it is difficult for linear models to incorporate the complexity of the WNV transmission network and hence identify an optimal number of risk factors objectively.
There are two objectives of this paper, first, use combination of genetic algorithm (GA) and computational neural network (CNN) approaches to build a model incorporating the non-linearity between incidences and hypothesized risk factors. Here GA is used for risk factor (variable) selection and CNN for model building mainly because of their ability to capture complex relationships with higher accuracy than linear models. The second objective is to propose a method to measure the relative importance of the selected risk factors included in the model. The study is situated in the metropolitan area of Minnesota, which had experienced significant outbreaks from 2002 till present.
32. Towards Interoperable and Reproducible QSAR Analyses: Exchange of Data Sets’
Spujth, O.; Willighagen, E.L.; Guha, R.; Eklund, M.; Wikberg, J.E.S.
J. Cheminformatics, 2010, 2,

[ Abstract ]
[ Link ]

QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data.
31. Towards interoperable and reproducible QSAR analyses: Exchange of datasets.
Spjuth, O.; Willighagen, E.L.; Guha, R.; Eklund, M.; Wikberg, J.
Journal of Cheminformatics, 2010, 2,

[ Abstract ]
[DOI 10.1186/1758-2946-2-5 ]

ABSTRACT: BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effo
rt has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue i
s the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior
to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analys
es and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproduc
ible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of
uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a datase
t described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup o
f QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations
from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up
new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining
which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes
it easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model’s performan
ce. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.
30. PubChem as a Source of Polypharmacology
Chen, B.; Wild, D.; Guha, R.
J. Chem. Inf. Model., 2009, 49, 2044-2055

[ Abstract ]
[DOI 10.1021/ci9001876 ]

Polypharmacology provides a new way to address the issue of high
attrition rates arising from lack of efficacy and toxicity. However,
the development of polypharmacology is hampered by the incomplete
SAR data and limited resources for validating target
combinations. The PubChem bioassay collection, reporting the activity of
compounds in multiple assays, allows us to study polypharmacological
behavior in the PubChem collection via cross-assay analysis. In this
paper, we developed a network representation of the assay collection
and then applied a bipartite mapping between this network and
various biological networks (i.e., PPI, pathway) as well as
artificial networks (i.e., drug-target network). Mapping to a
drug-target network allows us to prioritize new selective
compounds, while mapping to other biological networks enable us to
observe interesting target pairs and their associated compounds
in the context of biological systems. Our results indicate this
approach could be a useful way to investigate polypharmacology in
the PubChem bioassay collection.
29. Chemoinformatic Analysis of Drugs, Natural Products, Molecular Libraries Small Molecule Repository and Combinatorial Libraries
Singh, N.; Guha, R.; Guilianotti, M.; Houghten, R.; Medina-Franco, J.L.
J. Chem. Inf. Model., 2009, 49, 1010-1024

[ Abstract ]
[DOI 10.1021/ci800426u ]

A multiple criteria approach is presented, that is used to perform a comparative analysis of four recently developed combinatorial libraries to drugs, Molecular Libraries Small Molecule Repository (MLSMR) and natural products. The compound databases were assessed in terms of physicochemical properties, scaffolds, and fingerprints. The approach enables the analysis of property space coverage, degree of overlap between collections, scaffold and structural diversity, and overall structural novelty. The degree of overlap between combinatorial libraries and drugs was assessed using the R-NN curve methodology, which measures the density of chemical space around a query molecule embedded in the chemical space of a target collection. The combinatorial libraries studied in this work exhibit scaffolds that were not observed in the drug, MLSMR, and natural products databases. The fingerprint-based comparisons indicate that these combinatorial libraries are structurally different than current drugs. The R-NN curve methodology revealed that a proportion of molecules in the combinatorial libraries is located within the property space of the drugs. However, the R-NN analysis also showed that there are a significant number of molecules in several combinatorial libraries that are located in sparse regions of the drug space.
28. Navigating Structure Activity Landscapes
Bajorath, J.; Peltason, L.; Wawer, M.; Guha, R.; Lajiness, M.S.; van Drie, J.H.
Drug Discov. Today, 2009, 14, 698-705

[ Abstract ]
[DOI 10.1016/j.drudis.2009.04.003 ]

The problem of how to systematically explore structure-activity relationships (SARs) is still largely unsolved in medicinal chemistry. Recently, data analysis tools have been introduced to navigate activity landscapes and assess structure-activity relationships on a large scale. Initial investigations reveal a surprising heterogeneity among SARs and shed light on the relationship between `global’ and `local’ SAR features. Moreover, insights are provided into the fundamental issue of why modeling tools work well in some cases, but not in others.
27. Pharmacophore Representation and Searching
Guha, R.; Van Drie, J.H.
CDK News, 2008, ASAP

[ Abstract ]
[ Link ]

In this article we describe the design and use of a set of Java classes to represent pharmacophores and use such representations in pharmacophore searching applications.
26. Assessing How Well a Modeling Protocol Captures a Structure-Activity Landscape
Guha, R.; Van Drie, J.H.
J. Chem. Inf. Model., 2008, 48, 1716-1728

[ Abstract ]
[DOI 10.1021/ci8001414 ]

We introduce the notion of structure-activity landscape index (SALI) curves as a way to assess a model and a modeling protocol, applied to structure-activity relationships. We start from our earlier work [J. Chem. Inf. Model., 2008, 48, 646-658], where we show how to study a structure-activity relationship pairwise, based on the notion of “activity cliffs” – pairs of molecules that are structurally similar but have large differences in activity. There, we also introduced the SALI parameter, which allows one to identify cliffs easily, and which allows one to represent a structure-activity relationship as a graph. This graph orders every pair of molecules by their activity. Here, we introduce the new idea of a SALI curve, which tallies how many of these orderings a model is able to predict. Empirically, testing these SALI curves against a variety of models, ranging over two-dimensional quantitative structure-activity relationship (2D-QSAR), three-dimensional quantitative structure-activity relationship (3D-QSAR), and structure-based design models, the utility of a model seems to correspond to characteristics of these curves. In particular, the integral of these curves, denoted as SCI and being a number ranging from -1.0 to 1.0, approaches a value of 1.0 for two literature models, which are both known to be prospectively useful.
25. The Structure-Activity Landscape Index: Identifying and Quantifying Activity-Cliffs
Guha, R.; Van Drie, J.H.
J. Chem. Inf. Model., 2008, 48, 646-658

[ Abstract ]
[DOI 10.1021/ci7004093 ]

A new method for analyzing a structure-activity relationship is proposed. By use of a simple quantitative index, one can readily identify “structure-activity cliffs”: pairs of molecules which are most similar but have the largest change in activity. We show how this provides a graphical representation of the entire SAR, in a way that allows the salient features of the SAR to be quickly grasped. In addition, the approach allows us view the SARs in a data set at different levels of detail. The method is tested on two data sets that highlight its ability to easily extract SAR information. Finally, we demonstrate that this method is robust using a variety of computational control experiments and discuss possible applications of this technique to QSAR model evaluation.
24. A Flexible Web Service Infrastructure for the Development and Deployment of Predictive Models
Guha, R.
J. Chem. Inf. Model., 2008, 48, 456-464

[ Abstract ]
[DOI 10.1021/ci700188u ]

The development of predictive statistical models is a common task in
the field of drug design. The process of developing such models
involves two main steps: building the model and then deploying the
model. Traditionally such models have been deployed using web page
interfaces. This approach restricts the user to using the specified
web page and using the model in other ways can be cumbersome. In
this paper we present a flexible and generalizable approach to the
deployment of predictive models, based on a web service
infrastructure using R. The infrastructure described allows one to
access the functionality of these models using a variety of approach
ranging from web pages to workflow tools. We highlight the
advantages of this infrastructure by developing and subsequently
deploying random forest models for two datasets.
23. On the Interpretation and Interpretability of QSAR Models
Guha, R.
J. Comp. Aid. Molec. Des., 2008, 22, 857-871

[ Abstract ]
[DOI 10.1007/s10822-008-9240-5 ]

The goal of a quantitative structure–activity relationship (QSAR) model is to encode the relationship between molecular structure and biological activity or physical property. Based on this encoding, such models can be used for predictive purposes. Assuming the use of relevant and meaningful descriptors, and a statistically significant model, extraction of the encoded structure–activity relationships (SARs) can provide insight into what makes a molecule active or inactive. Such analyses by QSAR models are useful in a number of scenarios, such as suggesting structural modifications to enhance activity, explanation of outliers and exploratory analysis of novel SARs. In this paper we discuss the need for interpretation and an overview of the factors that affect interpretability of QSAR models. We then describe interpretation protocols for different types of models, highlighting the different types of interpretations, ranging from very broad, global, trends to very specific, case-by-case, descriptions of the SAR, using examples from the training set. Finally, we discuss a number of case studies where workers have provide some form of interpretation of a QSAR model.
22. Utilizing High Throughput Screening Data for Predictive Toxicology Models: Protocols and Application to MLSCN Assays
Guha, R.; Sch\”urer, S.C.
J. Comp. Aid. Molec. Des., 2008, 22, 367-384

[ Abstract ]
[DOI 10.1007/s10822-008-9192-9 ]

Computational toxicology is emerging as an encouraging alternative to experimental testing. The Molecular Libraries Screening Center Network (MLSCN) as part of the NIH Molecular Libraries Roadmap has recently started generating large and diverse screening datasets, which are publicly available in PubChem. In this report, we investigate various aspects of developing computational models to predict cell toxicity based on cell proliferation screening data generated in the MLSCN. By capturing feature-based information in those datasets, such predictive models would be useful in evaluating cell-based screening results in general (for example from reporter assays) and could be used as an aid to identify and eliminate potentially undesired compounds. Specifically we present the results of random forest ensemble models developed using different cell proliferation datasets and highlight protocols to take into account their extremely imbalanced nature. Depending on the nature of the datasets and the descriptors employed we were able to achieve percentage correct classification rates between 70% and 85% on the prediction set, though the accuracy rate dropped significantly when the models were applied to in vivo data. In this context we also compare the MLSCN cell proliferation results with animal acute toxicity data to investigate to what extent animal toxicity can be correlated and potentially predicted by proliferation results. Finally, we present a visualization technique that allows one to compare a new dataset to the training set of the models to decide whether the new dataset may be reliably predicted.
21. Userscripts for the Life Sciences
Willighagen, E.L.; O’Boyle, N.; Gopalakrishnan, H.; Jiao, D.; Guha, R.; Steinbeck, C.; Wild, D.J.
BMC Bioinformatics, 2007, 8, 487

[ Abstract ]
[DOI 10.1186/1471-2105-8-487 ]

The web has seen an explosion of chemistry
and biology related resources in the last 15 years: thousands of
scientific journals, databases, wikis, blogs and resources are available with
a wide variety of types of information. There is a huge need to aggregate
and organise this information. However, the sheer number of resources makes it
unrealistic to link them all in a centralised manner. Instead,
search engines to find information in those resources flourish, and formal
languages like Resource Description Framework and Web Ontology Language
are increasingly used to allow linking of resources.
A recent development is the use of userscripts to change the appearance of
web pages, by on-the-fly modification of the web content. This opens
possibilities to aggregate information and computational results from
different web resources into the web page of one of those resources.
20. Chemical Data Mining of the NCI Human Tumor Cell Line Database
Wang, H.; Klinginsmith, J.; Dong, X.; Lee, A.; Guha, R.; Wu, Y.; Crippen, G.; Wild, D.J.
J. Chem. Inf. Model., 2007, 47, 2063-2076

[ Abstract ]
[DOI 10.1021/ci700141x ]

The NCI Developmental Therapeutics Program Human Tumor cell line data set is a publicly available database that contains cellular assay screening data for over 40 000 compounds tested in 60 human tumor cell lines. The database also contains microarray assay gene expression data for the cell lines, and so it provides an excellent information resource particularly for testing data mining methods that bridge chemical, biological, and genomic information. In this paper we describe a formal knowledge discovery approach to characterizing and data mining this set and report the results of some of our initial experiments in mining the set from a chemoinformatics perspective.
19. Counting Clusters Using R-NN Curves
Guha, R.; Dutta, D.; Chen, T.; Wild, D.J.
J. Chem. Inf. Model., 2007, 47, 1308-1318

[ Abstract ]
[DOI 10.1021/ci600541f ]

Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for non-hierarchical clustering methods, such as $k$-means, is the number of clusters, k. Traditionally the value of $k$ is obtained by performing the clustering with different values of $k$ and selecting that value that leads to the optimal clustering. In this study we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J.~Chem.~Inf.~Model., 2006, 46, 1713-1722) which uses a nearest neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the dataset which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical datasets. Our results indicate the the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters
18. A Web Service Infrastructure for Chemoinformatics
Dong, X.; Gilbert, K.; Guha, R.; Heiland, R.; Kim, J.; Pierce, M.; Fox, G.; Wild, D.J.
J. Chem. Inf. Model., 2007, 47, 1303-1307

[ Abstract ]
[DOI 10.1021/ci6004349 ]

The vast increase of pertinent information available to drug discovery scientists means that there is strong demand for tools and techniques for organizing and intelligently mining this information for manageable human consumption. At Indiana University, we have developed an infrastructure of chemoinformatics web services that simplify the access to this information and the computational techniques that can be applied to it. In this paper, we describe this infrastructure, give some examples of its use, and then discuss our plans to use it as a platform for chemoinformatics application development in the future.
17. Ensemble Feature Selection: Consistent Descriptor Subsets for Multiple QSAR Models
Dutta, D.; Guha, R.; Chen, T.; Wild, D.J.
J. Chem. Inf. Model., 2007, 47, 989-997

[ Abstract ]
[DOI 10.1021/ci600563w ]

Selecting a small subset of descriptors from a large pool to build a predictive QSAR model is an important step in the QSAR modeling process. In general subset selection is very hard to solve, even approximately, with guaranteed performance bounds. Traditional approaches employ deterministic or stochastic methods to obtain a descriptor subset that leads to an optimal model of a single type (such as linear regression or a neural network). With the development of ensemble modeling approaches, multiple models of differing types are individually developed resulting in different descriptor subsets for each model type. However it is advantageous, from the point of view of developing interpretable QSAR models, to have a single set of descriptors that can be used for different model types. In this paper, we describe an approach to the selection of a single, optimal, subset of descriptors for multiple model types. We apply this approach to three datasets, covering both regression and classification, and show that the constraint of forcing different model types to use the same set of descriptors does not lead to a significant loss in predictive ability for the individual models considered. In addition, interpretations of the individual models developed using this approach indicate that they encode similar structure-activity trends.
16. Chemical Informatics Functionality in R
Guha, R.
J. Stat. Soft., 2007, 18,

[ Abstract ]
[ Link ]

The flexibility and scope of the R programming environment has made it a popular choice for statistical modeling and scientific prototyping in a number of fields. In the field of chemistry, R provides several tools for a variety of problems related to statistical modeling of chemical information. However, one aspect common to these tools is that they do not have direct access to the information that is available from chemical structures, such as contained in molecular descriptors.

We describe the rcdk package that provides the R user with access to the CDK, a Java framework for cheminformatics. As a result, it is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints. In addition, we describe the rpubchem that will allow access to the data in PubChem, a public repository of molecular structures and associated assay data for approximately 8 million compounds. Currently the package allows access to structural information as well as some simple molecular properties from PubChem. In addition the package allows access to bio-assay data from the PubChem FTP servers.

15. Local Lazy Regression: Making Use of the Neighborhood to Improve QSAR Predictions.
Guha, R.; Dutta, D.; Jurs, P.C.; Chen, T.
J. Chem. Inf. Model., 2006, 46, 1836-1847

[ Abstract ]
[DOI 10.1021/ci060064e ]

Traditional quantitative structure-activity relationship (QSAR) models aim to capture global structure-activity trends present in a data set. In many situations, there may be groups of molecules which exhibit a specific set of features which relate to their activity or inactivity. Such a group of features can be said to represent a local structure-activity relationship. Traditional QSAR models may not recognize such local relationships. In this work, we investigate the use of local lazy regression (LLR), which obtains a prediction for a query molecule using its local neighborhood, rather than considering the whole data set. This modeling approach is especially useful for very large data sets because no a priori model need be built. We applied the technique to three biological data sets. In the first case, the root-mean-square error (RMSE) for an external prediction set was 0.94 log units versus 0.92 log units for the global model. However, LLR was able to characterize a specific group of anomalous molecules with much better accuracy (0.64 log units versus 0.70 log units for the global model). For the second data set, the LLR technique resulted in a decrease in RMSE from 0.36 log units to 0.31 log units for the external prediction set. In the third case, we obtained an RMSE of 2.01 log units versus 2.16 log units for the global model. In all cases, LLR led to a few observations being poorly predicted compared to the global model. We present an analysis of why this was observed and possible improvements to the local regression approach.
14. R-NN Curves: An Intuitive Approach to Outlier Detection Using a Distance Based Method
Guha, R.; Dutta, D.; Jurs, P.C; Chen, T.
J. Chem. Inf. Model., 2006, 46, 1713-1722

[ Abstract ]
[DOI 10.1021/ci060013h ]

Libraries of chemical structures are used in a variety of cheminformatics tasks such as virtual screening and QSAR modeling and are generally characterized using molecular descriptors. When working with libraries it is useful to understand the distribution of compounds in the space defined by a set of descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library in general and outlier detection in particular based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, appear to follow a logistic model for any given descriptor space, which we justify theoretically for the 2D case. The method can be applied to data sets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We also present a method to numerically characterize the R-NN curves thus allowing identification of outliers in a single plot.
13. The Blue Obelisk–Interoperability in Chemical Informatics.
Guha, R.; Howard, M.T.; Hutchison, G.R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C.; Wegner, J.; Willighagen, E.L.
J. Chem. Inf. Model., 2006, 46, 991-998

[ Abstract ]
[DOI 10.1021/ci050400b ]

The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group promoting reusable chemistry via open source software development, consistent and complimentary chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics algorithms drawing from our various software projects; a shared repository of chemoinformatics data including elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-independent use of chemoinformatics programs.
12. Scalable Partitioning and Exploration of Chemical Spaces using Geometric Hashing
Dutta, D.; Guha, R.; Jurs, P.C.; Chen, T.
J. Chem. Inf. Model., 2006, 46, 321-333

[ Abstract ]
[DOI 10.1021/ci050403o ]

Virtual screening (VS) has become a preferred tool to augment high-throughput screening1 and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249,071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.
11. Generating, Using and Visualizing Molecular Information in R
Guha, R.
R News, 2006, 3, 28-33

[ Abstract ]
[ Link ]

10. Validation of the CDK Surface Area Routine
Guha, R.
CDK News, 2006, 3, 5-9
9. Recent Developments of the Chemistry Development Kit (CDK) – An Open-Source Java Library for Chemo- and Bioinformatics
Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E.L.
Curr. Pharm. Des., 2006, 12, 2110-2120

[ Abstract ]
[DOI 10.2174/138161206777585274 ]

The Chemistry Development Kit (CDK) provides methods for common tasks in molecular informatics, including 2D and 3D rendering of chemical structures, I/O routines, SMILES parsing and generation, ring searches, isomorphism checking, structure diagram generation, etc. Implemented in Java, it is used both for server-side computational services, possibly equipped with a web interface, as well as for applications and client-side applets. This article introduces the CDK’s new QSAR capabilities and the recently introduced interface to statistical software.
8. Interpreting Computational Neural Network QSAR Models: A Detailed Interpretation of the Weights and Biases
Guha, R.; Stanton, D.T.; Jurs, P.C.
J. Chem. Inf. Model., 2005, 45, 1109-1121
7. Interpreting Computational Neural Network QSAR Models: A Measure of Descriptor Importance
Guha, R.; Jurs, P.C.
J. Chem. Inf. Model., 2005, 45, 800-806

[ Abstract ]
[DOI 10.1021/ci050022a ]

We present a method to measure the relative importance of the descriptors present in a QSAR model developed with a computational neural network (CNN). The approach is based on a sensitivity analysis of the descriptors. We tested the method on three published data sets for which linear and CNN models were previously built. The original work reported interpretations for the linear models, and we compare the results of the new method to the importance of descriptors in the linear models as described by a PLS technique. The results indicate that the proposed method is able to rank descriptors such that important descriptors in the CNN model correspond to the important descriptors in the linear model.
6. Determining the Validity of a QSAR Model–A Classification Approach
Guha, R.; Jurs, P.C.
J. Chem. Inf. Model., 2005, 45, 65-73

[ Abstract ]
[DOI 10.1021/ci0497511 ]

The determination of the validity of a QSAR model when applied to new compounds is an important concern in the field of QSAR and QSPR modeling. Various scoring techniques can be applied to specific types of models. We present a technique with which we can state whether a new compound will be well predicted by a previously built QSAR model. In this study we focus on linear regression models only, though the technique is general and could also be applied to other types of quantitative models. Our technique is based on a classification method that divides regression residuals from a previously generated model into a good class and bad class and then builds a classifier based on this division. The trained classifier is then used to determine the class of the residual for a new compound. We investigated the performance of a variety of classifiers, both linear and nonlinear. The technique was tested on two data sets from the literature and a hand built data set. The data sets selected covered both physical and biological properties and also presented the methodology with quantitative regression models of varying quality. The results indicate that this technique can determine whether a new compound will be well or poorly predicted with weighted success rates ranging from 73% to 94% for the best classifier.
5. Using R to Provide Statistical Functionality for QSAR Modeling in CDK to Provide Statistical Functionality for QSAR Modeling in CDK
Guha, R.
CDK News, 2005, 2, 7-13

[ Abstract ]
[ Link ]

4. Using the CDK as a Backend to R
Guha, R.
CDK News, 2005, 2, 2-6

[ Abstract ]
[ Link ]

3. Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors.
Guha, R.; Jurs, P.C.
J. Chem. Inf. Comput. Sci., 2004, 44, 2179-2189

[ Abstract ]
[DOI 10.1021/ci049849f ]

A QSAR modeling study has been done with a set of 79 piperazyinylquinazoline analogues which exhibit PDGFR inhibition. Linear regression and nonlinear computational neural network models were developed. The regression model was developed with a focus on interpretative ability using a PLS technique. However, it also exhibits a good predictive ability after outlier removal. The nonlinear CNN model had superior predictive ability compared to the linear model with a training set error of 0.22 log(IC50) units (R2 = 0.93) and a prediction set error of 0.32 log(IC50) units (R2 = 0.61). A random forest model was also developed to provide an alternate measure of descriptor importance. This approach ranks descriptors, and its results confirm the importance of specific descriptors as characterized by the PLS technique. In addition the neural network model contains the two most important descriptors indicated by the random forest model.
2. The Development of QSAR Models To Predict and Interpret the Biological Activity of Artemisinin Analogues
Guha, R.; Jurs, P.C.
J. Chem. Inf. Comput. Sci., 2004, 44, 1440-1449

[ Abstract ]
[DOI 10.1021/ci0499469 ]

This work presents the development of Quantitative Structure-Activity Relationship (QSAR) models to predict the biological activity of 179 artemisinin analogues. The structures of the molecules are represented by chemical descriptors that encode topological, geometric, and electronic structure features. Both linear (multiple linear regression) and nonlinear (computational neural network) models are developed to link the structures to their reported biological activity. The best linear model was subjected to a PLS analysis to provide model interpretability. While the best linear model does not perform as well as the nonlinear model in terms of predictive ability, the application of PLS analysis allows for a sound physical interpretation of the structure-activity trend captured by the model. On the other hand, the best nonlinear model is superior in terms of pure predictive ability, having a training error of 0.47 log RA units (R2 = 0.96) and a prediction error of 0.76 log RA units (R2 = 0.88).
1. Generation of QSAR Sets with a Self-Organizing Map.
Guha, R.; Serra, J.R.; Jurs, P.C.
J. Mol. Graph. Model., 2004, 23, 1-14

[ Abstract ]
[DOI 10.1016/j.jmgm.2004.03.003 ]

A Kohonen self-organizing map (SOM) is used to classify a data set consisting of dihydrofolate reductase inhibitors with the help of an external set of Dragon descriptors. The resultant classification is used to generate training, cross-validation (CV) and prediction sets for QSAR modeling using the ADAPT methodology. The results are compared to those of QSAR models generated using sets created by activity binning and a sphere exclusion method. The results indicate that the SOM is able to generate QSAR sets that are representative of the composition of the overall data set in terms of similarity. The resulting QSAR models are half the size of those published and have comparable RMS errors. Furthermore, the RMS errors of the QSAR sets are consistent, indicating good predictive capabilities as well as generalizability.

Written by Rajarshi Guha

June 29th, 2013 at 1:11 pm

Posted in