Datasets for Virtual Screening Benchmarks

Virtual screening (VS) is a common task in the drug discovery process and is a computational method to identify  promising compounds from a collection of hundreds to millions of possible compounds. What “promising” exactly means, depends on the context – it might be compounds that will likely exhibit certain pharmacological effects. Or compounds that are expected to non-toxic. Or combinations of these and other properties. Many methods are available for virtual screening including similarity, docking and predictive models.

So, given the plethora of methods which one do we use? There are many factors affecting choice of VS method including availability, price, computational cost and so on. But in the end, deciding which one is better than another depends on the use of benchmarks. There are two features of VS benchmarks: the metric employed to decide whether one method is better than another and the data used for benchmarking. This post focuses on the latter aspect.

Continue reading

Which Bits are Important for Similarity Searches?

The recent paper by Wang and Bajorath is an interesting approach to identifying the important bits in a fingerprint, with respect to a dataset.

Their discussion focuses on the structural key type fingerprints (such as MACCS and the BCI fingerprints) and the problem they are trying to address is the fact that certain structural features may be more important for similarity searching than others. This is also related to the fact that molecular complexity (i.e., the number of structural features) can lead to bias in similarity calculations [1]. Given a dataset, an easy way to identify the important bits is the so called consensus approach [2, 3]- basically find out which bit positions are set to 1 for all (or a specified fraction) of the dataset. While useful, this can be misled if the target dataset has many molecules with a large number of structural features (so that many bits in the fingerprint will be set to 1).

Continue reading

AJAX’ified Pub3D

Pub3D is a 3D version of PubChem, in which we have generated a single conformer for 99% of PubChem using the smi23d suite of programs. The structures are then stored in a PostgreSQL database along with their distance moment shape descriptors described by Ballester and Graham-Richards. This allows us to perform shape similarity queries against a user supplied 3D structure. By partitioning the database (thanks to the CGL folks at IU) and using a spatial index, performance is quite snappy. (I had briefly mentioned this in a presentation at the ACS meeting, last spring).

The database had been down for some time, so today I got it back up and running and AJAX’ified the interface, to make it look a little nicer.  jQuery rocks! (OK, the color scheme sucks)

There are obvious drawbacks to the current database – single conformer shape search is not very rigorous, especially since the stored structures are not necessarily the minimum energy conformer. However, we have started generating multiple conformers, so hopefully we’ll address this issue in time. The bigger issue is how this approach to shape similarity compares to other well known approaches such as ROCS. Clearly, a shape descriptor approach is lower resolution to a volumetric approach such as ROCS, so in that sense the results are ‘rougher’. However visual inspection of some searches seems to indicate that it isn’t too bad. The paper describing these shape descriptors didn’t do a rigorous comparison – that’s on our TODO list.

OK, the fun part (a.k.a, coding) is done for now – got to get back to the paper.

Locality of References in a Paper

The other day I was reading a paper and as is my habit, while reading I flip to see what papers are being cited. Since this was an ACS journal, the references are listed in the order that they occur in the text. When the authors were discussing a point in the paper, they’d usually include a number of references. Given the ordering of the references, this implies that related references are grouped together in the bibliography.

This set me thinking – given a set of references and their citations within a paper, we can capture relationships between the references in various ways. Most obviously, one might analyze the  cited papers (either in whole, or in part such as just the abstract or title) and draw conclusions.

However, the fact that the authors of the paper considered references X,  Y and Z to be related to a specific point already provides us with some information. Thus  in a bibliography where references are order based on first occurrence, can we use the “locality” of the references in the list to draw any conclusions? One could employ some form of a sliding window and look at groups of references. The key thing here would be to have a way to characterize a reference – so it’d probably require that you can access the title (or better, the abstract or full text) of the paper being cited. I will admit that I’m not sure what sort of conclusions one might draw from such an analysis – but it was interesting to observe “local behavior” in a list of references.

Not having followed work in bibliometrics, I’m sure someone has already thought of this and looked into it. If anybody has heard of stuff like this, I’d appreciate any pointers.

(Of course this is all moot, if we can’t easily access the paper itself)

Moving to SlideShare

Finally got round to putting a number of my slides onto SlideShare. While I was skeptical initially, I’ve found it quite handy to quickly browse through a presentation without having to download PDF’s or PPT’s and start up the viewers. Also this lets me not have to maintain a webpage listing all the presentations I’ve made, though it’ll probably still be there for the near future. However a new SlideShare widget lets me put up a single interface to all my presentations – it looks like a nice way to let users browse all the presentations (or a subset of them if I do some grouping)