Hi

Very interesting topic. I have faced these challenges while working with fingerprints and here are few observations from my end. By the way I agree that mathematically the best bet is ~ 13%.

1) The hashed FP (CDK) is good enough to separate patterns which are not common but on a large dataset (in my case 10000+ mols), the performance drops drastically. Top 1% hits were good but then rest of the started to loose specificity (esp when Tanimoto score was around 0.77).

2) First I thought it was an artifact of the Tanimoto score… but I wasn’t convinced spl. in cases where we had rings (close vs open). I ended up writing a new FP based on the pubchem patterns as coded in the CDK and added few more patterns to resize it to 1024 from 881. Well! It’s works like magic and I could find much more serialised hits than before. I think the extensions of the fingerprint which I made based on the patterns in my db also helped.

At the end of the day, I believe that all these searches are heuristic and hashed FP is faster to generate but prone to bit clashes where as SMARTS based FPs are slower to generate (as u spend time in MCS) in matching patterns but they are more sensitive and specific as you can trace the patterns (u get what u see) as the patterns and bitset relationship is know and static.

Just a thought…..

]]>Very interesting topic. I have faced these challenges while working with fingerprints and here are few observations from my end. By the way I agree that mathematically the best is ~ 13%.

1) hashed FP (CDK) is good enough to separate patterns which are not common but on a very large dataset (in my case 10000+ mols), the performance dropped. Top 1% hits were good but then I started to loose specificity (esp when Tanimoto score was around 0.77).

2) First I thought it was an Tanimoto score but I wasn’t convinced incases where we had rings (close vs open). I ended up writing new FP based on the pubchem patterns as coded in the CDK and added few more patterns to resize it to 1024 from 881. Well! It’s works like magic and I could find much more serialised hits than before. I think the extensions of the fingerprint which I made based on the patterns in my db also helped.

At the end of the day I believe all these searches are heuristic and hashed FP is faster to generate but prone to bit clash where as SMARTS based fps are slower to generate as u sped time in matching patterns but are more sensitive and specific as u get what u see (patterns and bitset relationship is know).

Just a thought…..

]]>Anyway, I’ve enjoyed this peek under the hood of fingerprinting in CDK. I assume other tools do it a similar way? ]]>

The fingerprinter code uses 64 *different* random number sequences. and pulls one value from each individual sequence.

]]>If we use a perfect (behaves ideally) random number generator ‘R1′, then the chance of avoiding collisions in the set of 64 unique hashes is 13%.

If we define a new random number generator ‘R2′ with the algorithm “use R1 but burn the first number” then it too must be perfect, and we’d still have 13% chance of avoiding collisions.

Generalize to Rn and I expect 13% of these ideal RNG’s will luckily avoid collisions

]]>Ths I’d have expected the probability of two separate RNG sequences intersecting to be low – but this is not a rigorous conclusion. Hopefully the math will prove me wrong

]]>In [1]: p = 1

In [2]: for i in range(64):

…: p *= (1024 – i) / 1024.0

In [3]: p

Out[3]: 0.13388743455337332