Which Bits are Important for Similarity Searches?

The recent paper by Wang and Bajorath is an interesting approach to identifying the important bits in a fingerprint, with respect to a dataset.

Their discussion focuses on the structural key type fingerprints (such as MACCS and the BCI fingerprints) and the problem they are trying to address is the fact that certain structural features may be more important for similarity searching than others. This is also related to the fact that molecular complexity (i.e., the number of structural features) can lead to bias in similarity calculations [1]. Given a dataset, an easy way to identify the important bits is the so called consensus approach [2, 3]- basically find out which bit positions are set to 1 for all (or a specified fraction) of the dataset. While useful, this can be misled if the target dataset has many molecules with a large number of structural features (so that many bits in the fingerprint will be set to 1).

The method that Wang and Bajorath describe is termed bit silencing and is quite simple. Given a query fingerprint of n bits, they perform the similarity search against a collection of target molecules and retrieve, say, N₀ hits. They then set the first bit in the query fingerprint to 0 and redo the search, this time getting say N₁ hits. This is repeated n times, each time setting a bit position to 0. The idea is that if a bit set to 0 is truly important for the similarity calculation, then on setting it to zero, the number of hits retrieved will be lowered. On the other hand, if the bit (i.e., structural feature) is hindering retrieval of similar compounds, then when it is set to zero, the number of hits will increase. So in a sense it’s similar to the variable importance measure of random forests. Given the N_i values for each bit positions, they then derive a set of weights for each bit position, which are then used in a modified version of the Tanimoto score.

They show some interesting results – the weighted Tanimoto score derived from bit silencing seems to result in better retrieval rates. However, one of the most striking was the fact that the bits in the MACCS keys for a dataset, highlighted as important by the consensus approach do not seem to be very important by the bit silencing method. In other words, just because this feature happens to be present in everything, it does not appear to contribute to the ability of the fingerprint (and metric) to retrieve similar compounds.

One of the problems with the method is that it converts a process that is O(m) (where m is the number of molecules) to O(mn). Of course, for a given dataset, n would be fixed so it’s not that bad. Another issue is the issue of deriving the weights for each position. To do this, one must have a training set of target fingerprints on which the bit silencing method is applied. In this sense, it seems to lack generality. Another thing that bothers me a little was – how large a training set does one need to get reliable importance measures (or weights)? Their examples ranged from 80 to 600. Is there any way to decide whether an importance measure was significant (in a statistical sense)? Clearly, if one is dealing with 20 target fingerprints, then the importance of a bit may not be as rigorous compared to the result obtained from looking at 200 compounds.

Overall an interesting and novel approach to identifying bits that are important for the retrieval rate in similarity search methods.

So much to do, so little time

Trying to squeeze sense out of chemical data

Which Bits are Important for Similarity Searches?

Leave a Reply Cancel reply