Datasets for Virtual Screening Benchmarks

Virtual screening (VS) is a common task in the drug discovery process and is a computational method to identify promising compounds from a collection of hundreds to millions of possible compounds. What “promising” exactly means, depends on the context – it might be compounds that will likely exhibit certain pharmacological effects. Or compounds that are expected to non-toxic. Or combinations of these and other properties. Many methods are available for virtual screening including similarity, docking and predictive models.

So, given the plethora of methods which one do we use? There are many factors affecting choice of VS method including availability, price, computational cost and so on. But in the end, deciding which one is better than another depends on the use of benchmarks. There are two features of VS benchmarks: the metric employed to decide whether one method is better than another and the data used for benchmarking. This post focuses on the latter aspect.

The goal of benchmarking is to check whether the method actually identify the relevant molecules from the collection. Most approaches to benchmarking will consider a class of molecules (say thrombin inhibitors or COX inhibitors). The procedure will then take a few of the actives, denoted the query molecules and place the rest of the actives in the target set. Next a large collection of compounds that are known to be not active against the target of interest are added to the target set. These compounds are considered decoys or background molecules. The benchmarking procedure will then be run and the goal is to see which method will identify the largest number true actives in the target set given the query actives. However, it’s not just the ability to identify the true actives that is important. The key is to identify the true actives in an efficient manner – that is, just find the true actives and not the decoy molecules (which would be regarded as false positives) and find them by looking at a small portion of the target collection as possible.

Given the results of the method, a common way to quantitatively compare VS methods is to generate enrichment curves and enrichment factors. It is well known that enrichment factors suffer from some drawbacks and alternatives have been discussed by Truchon et al and Clark and Webster. I won’t go into those details here.

But what is the data used to do these benchmarks? Many VS benchmarks have employed the MDDR which is a commercial database of compounds divided into several activity classes (based on biological targets). There are several problems with this dataset. First, it contains many close analogs. This leads to analog bias and unfairly favors 2D methods. Second, the database does not contain true inactives – it reports endpoints and while one can categorize compounds based on endpoints, there is no explicit notion of inactive compounds. Third, it is a proprietary dataset and as a result, unless one has a license one cannot reproduce published work that uses the MDDR. A slightly more open source is WOMBAT, but it is still not a completely open public database.

In terms of completely public datasets, one of the most well known is DUD. This dataset covers 40 targets and is constructed so that the decoys resemble the known actives physically but not topologically (so avoiding the unfair advantage given to 2D methods). A new addition to the VS benchmark data set collection is from Rohrer and Baumann. Their work analyses the topology of chemical spaces and their effects on VS benchmarks. While a very interesting analysis on which I will blog later, they also published 17 VS benchmark datasets that are completely public. The data is derived from PubChem and covers various activity classes such as HSP90 inhibitors and Cathepsin G inhibitors. While the activity classes do not cover the common drug target classes as used in the MDDR, they still represent a welcome addition to the VS benchmark data collection. The datasets can be obtained from their website. Each dataset has 30 actives and 15,000 decoys and all datasets have a relatively high diversity of scaffolds. As a result, they have been designed to try and avoid the analog bias problem. Regarding the assignment of compounds as active or decoys, they note

The assignment of actives and decoys is based on pairs of primary and confirmatory assays against the same target. Compounds that were found active in the confirmation screen were considered potential actives. These compounds were further filtered for aggregators, frequent hitters and compounds interfering with the optical detection methods of the assays.

Hopefully more studies will employ this dataset and thus allow themselves to be reproduced.

One thought on “Datasets for Virtual Screening Benchmarks”

Do the CDK Fingerprints Work? « So much to do, so little time says:

October 11, 2008 at 5:48 am

[…] 11, 2008 by Rajarshi Guha In a previous post, I dicussed virtual screening benchmarks and some new public datasets for this purpose. I recently […]

So much to do, so little time

Trying to squeeze sense out of chemical data

One thought on “Datasets for Virtual Screening Benchmarks”

Leave a Reply Cancel reply