## Archive for October, 2009

## A New Toolkit on the Block

A few days ago I was pointed to a new cheminformatics toolkit called Indigo, written in C++ with source code available under the GPLv3 license. Rich has previously commented on this. While it’s an initial release, it has a number of interesting components such as an Oracle cartridge, 2D depiction and scaffold detection and R-group generation. Unfortunately I wasn’t able to run the OS X binaries – I’ll give it a try on Linux. The future plans for the library also seem quite interesting including wrappers for other languages, chemical OCR etc. It’s also nice to see that they have a page describing the algorithms they implemented – some of them well known ones such as the Kabsch alignment method and others (fingerprints, depictions) that appear to be unique to the toolkit. Unfortunately, most of the descriptions are in Russian, so we’ll have to wait for English translations.

## Drug Discovery Trends In and From the Literature

I came across a recent paper by Agarwal and Searls which describes a detailed bibliometric analysis of the scientific literature to identify and characterize specific research topics that appear to be drivers for drug discovery – i.e., research areas/topics in life sciences that exhibit significant activity and thus might be fruitful for drug discovery efforts. The analysis is based on PubMed abstracts, citation data and patents and addresses research topics ranging from very broad WHO disease categories (such as respiratory infectins and skin diseases) to more specific topics via MeSH headings and down to the level of indivudal genes and pathways. The details of the methodology are described in a separate paper, but this one provides some very interesting analyses and visualizations of drug discovery trends.

The authors start out by defining the idea of “push” and “pull”:

Unmet medical need and commercial potential, which are considered the traditional drivers of drug discovery, may be seen as providing ‘pull’ for pharmaceutical discovery efforts. Yet, if a disease area offers no ‘push’ in the form of new scientific opportunities, no amount of pull will lead to new drugs — at least not mechanistically novel ones …

The authors then describe how they characterized “push” and “pull”. The key premise of the paper is that publications rates are an overall measure of activity and when categorized by topic (disease area, target, pathway etc), represent the activity in that area. Of course, there are many factors that characterize why certain therapeutic or target areas receive more interest than others and the authors clearly state that even their concepts of “push” and “pull” likely overlap and thus are not independent.

For now I’ll just highlight some of their analysis from the “pull” category. For example, using therapeutic areas from a WHO list (that characterized disease burden) and PubMed searches to count the number of papers in a given area, they generated the plot below. The focus on global disease burden and developed world disease burden was based on the assumption that the former measures general medical need and the latter measures commercial interest.

Another interesting summary was the rate of change of publication in a given area (this time defined by MeSH heading). This analysis used a 3 year window over a 30 year period and highlighted some interesting short term trends – such as the spurt in publications for viral research around the early 80’s which likely corresponded to discovery of retroviral involvement in human disease and then later by the AIDS epidemic. The data is summarized in the figure below, where red corresponds to a spurt in activity and blue to a stagnation in publication activity.

They also report analyses that focus on high impact papers, based on a few of the top tier journals (as identified by impact factor) – while their justification is reasonable, it does have downsides which the authors highlight. Their last analysis focuses on specific diseases (based on MeSH disease subcategories) and individual protein targets, comparing the rate of publications in the short term (2 year windows) versus medium term (5 year windows). The idea is that this allows one to identify areas (or targets etc) that exhibit consistent growth (or interest), accelerating and decelerating growth . The resultant plots are quite interesting – though I wonder about the noise involved when going to something as specific as individual targets (identified by looking for gene symbols and synonyms).

While bibliometric analysis is a high level approach the authors make a compelling case that the technique can identify and summarize the drug discovery research landscape. The authors do a great job in rationalizing much of their results. Of course, the analyses are not prospective (i.e., which area should we pour money into next?). I think one of the key features of the work is that it quantitatively characterizes research output and is able to link this to various other efforts (specific funding programs, policy etc) – with the caveat that there are many confounding factors and causal effects.

## A Custom Palette for Heatmaps

Heatmaps are a common way to visualize matrices and R provides a variety of methods to generate these diagrams. One of the key features of a heatmap is the color scheme employed. By default the image method uses heat.colors which ranges from red (lowest values) to white (highest values). Other palettes include rainbow and topographical. However, I needed to replicate a color scheme used in a previous publication – dark red for low values and white for high values, passing through shades of pink. While RColorBrewer is an extremely useful package for generating sensible and informative color schemes it didn’t help me out here – the red palette was limited to 9 colors, whereas I needed a larger range. There’s also the colorschemes method from the ClassDiscovery package, but I haven’t checked that out.

So, I ended up making my own color scheme, realizing that shades of pink are a constant red, coupled with increasing amounts of blue and green (in the RGB scheme). Thus generating a matrix of R, G and B values, with a constant red value and increasing blue and green values and then converting the rows to hex colors (via rgb, gives me the required colors.

1 2 3 4 | rgbs <- cbind(255, seq(32,255, length=256), seq(32,255, length=256)) / 255 cols <- rgb(rgbs[,1], rgbs[,2], rgbs[,3]) |

and an example of the color scheme

1 2 | image(matrix(1:256, nrow=64, byrow=TRUE), col=cols) |

## Correlating Continuous and Categorical Variables

At work, a colleague gave an interesting presentation on characterizing associations between continuous and categorical variables. I expect that I will be facing this issue in some upcoming work so was doing a little reading and made some notes for myself.

Given a continuous variable Y and a categorical variable G, is the distribution of Y independent of the levels of G? One can address this using parametric or non-parametric methods. Due to the superior power of parametric methods we start by considering them and only if we completely fail to satisfy the distributional assumptions of such methods, do we move to the non-parametric methods.

A simple approach is to convert G to an indicator variable matrix, such that each level of G becomes a binary vector, and then use this matrix and Y to fit a linear model. One can then perform an ANOVA on the fit to check whether Y is independent of the groups. This approach assumes that Y is normally distributed, and that variances are equal amongst the groups.

A simple example in R is as follows. We first create a dataset in which there is no difference in the mean or variance across the groups, one in which the means differ and one in which the variances differ

1 2 3 4 5 6 7 8 9 10 11 | G <- factor(sample(c('A','T','C','G'), 100, TRUE, c(0.2,0.2,0.2,0.2))) Y1 <- rnorm(100, mean=10, sd=3) Y2 <- Y1 Y2[which(G=='G')] <- sample(rnorm(length(which(G=='G')),mean=4,sd=3)) Y3 <- Y1 Y3[which(G=='G')] <- sample(rnorm(length(which(G=='G')),mean=10,sd=6)) |

A summary of the Y’s grouped by G are given below.

We now run the ANOVA, using the first dataset.

1 2 | fit <- lm(Y1 ~ G, data=data.frame(Y1,G)) anova(fit) |

Running this gives us

1 2 3 4 5 | Analysis of Variance Table Response: Y1 Df Sum Sq Mean Sq F value Pr(>F) G 3 0.74 0.25 0.0304 0.9928 Residuals 96 777.19 8.10 |

If we consider the 0.05 significance level, we would not reject the null hpothesis that Y is independent of G. If we now rerun the ANOVA using the second dataset

1 2 | fit <- lm(Y2 ~ G, data=data.frame(Y2,G)) anova(fit) |

we get

1 2 3 4 5 | Analysis of Variance Table Response: Y2 Df Sum Sq Mean Sq F value Pr(>F) G 3 1131.65 377.22 49.321 < 2.2e-16 *** Residuals 96 734.22 7.65 |

which suggests that we reject the null hypothesis and accept the alternative that there is the Y’s are not independent of the groups (which is true by design).

Furthermore, it really only indicates whether there is a dependence between the groups and the means of the group members – it doesn’t tell us anything about lack of independence in terms of scale parameters. This is not surprising since the linear model is defined to have a constant variance. Thus, if we want to investigate dependence of the variance on groups, we should look at distributions of residuals or other approaches to identifying heteroscedacity (such as Levenes test).

In case we cannot satisfy the assumptions (even after investigating possible transformations of the data) required for ANOVA, we can move to parametric methods, which are easy to run, but have a lower power. For this problem, the Kruskal-Wallis test and the Fligner-Killeen test come in handy. These methods are in the base **stats** package but also come with the coin package which is handy for running a variety of independence tests.

The Kruskal-Wallis test makes no assumptions about normality, but does assume that the distributions amongst the levels of G are of the same shape and have the same variance (and so, the differences are due to differences in the means). The Fligner-Killeen test also avoids the normality assumption bu in this case, has the null hypothesis that the variances in the groups are the same. Thus ideally, we’d run the Kruskal-Wallis first and if we could not reject the null hypothesis we then investigate whether there is a difference in the variances across the groups by the Fligner-Killeen tests.

Some examples highlight the analysis. First, we consider the dataset that is independent of the groups (Y1):

1 2 3 4 5 6 | > kruskal_test(Y1 ~ G, data=data.frame(Y1,G)) Asymptotic Kruskal-Wallis Test data: Y1 by G (A, C, G, T) chi-squared = 0.0628, df = 3, p-value = 0.9959 |

which indicates that we fail to reject the null hypothesis that the distribution of Y1 is independent of the levels of G. We then consider the dataset where the means are dependent on the groups

1 2 3 4 5 6 | > kruskal_test(Y2 ~ G, data=data.frame(Y2,G)) Asymptotic Kruskal-Wallis Test data: Y2 by G (A, C, G, T) chi-squared = 57.3197, df = 3, p-value = 2.196e-12 |

which indicates the we can reject the null hypothesis. However, note that for Y3, where the means are independent of the groups but the variance differ, the Kruskal-Wallis test is not useful:

1 2 3 4 5 6 | > kruskal_test(Y3 ~ G, data=data.frame(Y3,G)) Asymptotic Kruskal-Wallis Test data: Y3 by G (A, C, G, T) chi-squared = 1.5399, df = 3, p-value = 0.673 |

On the other hand, if we run the Fligner-Killeen test on these three datasets, we correctly identify Y3 as having its variance dependent on G

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | > fligner_test(Y1 ~ G, data=data.frame(Y1,G)) Asymptotic Fligner-Killeen Test data: Y1 by G (A, C, G, T) chi-squared = 2.9914, df = 3, p-value = 0.3929 > fligner_test(Y2 ~ G, data=data.frame(Y2,G)) Asymptotic Fligner-Killeen Test data: Y2 by G (A, C, G, T) chi-squared = 2.8202, df = 3, p-value = 0.4202 > fligner_test(Y3 ~ G, data=data.frame(Y3,G)) Asymptotic Fligner-Killeen Test data: Y3 by G (A, C, G, T) chi-squared = 15.3493, df = 3, p-value = 0.001541 |

## Oracle Notes

Some handy settings when running a query from the command line via sqlplus

set echo off set heading on set linesize 1024 set pagesize 0 set tab on set trims on set wrap off

-- might want to set column formats here

-- e.g.: column foo format A10

spool stats -- dump results to stats.lst

-- SQL query here

spool off

exit