In my previous post I looked at how many collisions in bit positions were observed when generating hashed fingerprints (using the CDK 1024-bit hashed fingerprint and the Java hashCode method). I summarized the results in the form of “bit collision plots” where I plotted the number of times a bit was set to 1 versus the bit position (for a given molecule). As expected, for a series of molecules we observe a number of collisions in multiple bits. What was a little surprising was that even for a symmetric molecule like triphenylphosphine (i.e., a relatively small number of topologically unique paths), we observed collisions in two bits. So I decided to look into this case in a little more detail.
As I noted, collisions could occur if a) different paths get hashed to the same int or b) two different hashes lead to the same random number. Modifying the Fingerprinter code, I was able to generate the list of paths calculated for triphenylphosphine, the hash code for each path and the bit position that was generated for that hash code. The data (hash value, path, bit position) is given below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | -662168118 P-C:C-H 3 -1466409134 H-C:C-P-C:C:C 58 1279033458 C:C:C:C:C-P-C:C:C 78 1434821739 H-C:C:C:C:C:C 80 -429128489 C:C:C:C-P-C 85 -1779215129 C:C:C:C:C 95 -245205916 H-C:C:C-P-C:C:C:C 97 -475263438 C:C:C:C:C:C-P-C:C 111 -1466409532 H-C:C-P-C:C-H 114 1434821341 H-C:C:C:C:C-H 142 -428753296 C:C:C:C:C:C 161 -245206314 H-C:C:C-P-C:C:C-H 161 1730724873 H-C:C:C-P-C 167 43327460 H-C:C:C:C:C-P-C:C 180 1731099668 H-C:C:C:C-H 180 886716569 H-C:C:C:C 182 -1406488727 C:C:C:C:C-P-C:C 191 78342 P-C 193 1731100066 H-C:C:C:C:C 211 63670037 C:C:C 213 178815835 H-C:C:C:C:C-P-C 224 179190630 H-C:C:C:C:C:C-H 230 -469902821 H-C:C-P-C:C:C:C 244 1057365278 C:C:C:C 253 -181369445 H-C:C:C:C-P-C:C 266 1056990085 C:C-P-C 300 886341376 H-C:C-P-C 333 827737424 H-C:C:C 390 -912402715 P-C:C:C:C:C-H 406 1572927053 H-C:C:C-P-C:C-H 421 1434446546 H-C:C:C:C-P-C 440 403512740 H-C:C:C:C:C:C-P-C 442 827737026 H-C:C-H 448 1572927451 H-C:C:C-P-C:C:C 458 -645296786 P-C:C:C:C:C:C-H 493 -688017645 P-C:C:C-H 503 -763796814 C:C:C:C-P-C:C:C:C 512 66252 C:C 574 75288527 P-C:C 600 -605043036 H-C:C-P-C:C:C:C:C 604 1074261266 H-C:C:C-P-C:C 629 -789313769 C:C:C-P-C:C 639 -2139775602 C:C-P-C:C 675 1370539593 H-C:C-P-C:C 698 284569632 C:C:C:C:C-P-C 702 1797624356 H-C:C:C:C-P-C:C:C 710 63294844 C-P-C 719 -912402317 P-C:C:C:C:C:C 725 72 H 741 67 C 742 80 P 744 -1779590322 C:C:C-P-C 762 1797623958 H-C:C:C:C-P-C:C-H 774 347808169 C:C:C:C-P-C:C:C 788 -688017247 P-C:C:C:C 815 240390684 P-C:C:C:C-H 834 -75615648 C:C:C:C-P-C:C 859 240391082 P-C:C:C:C:C 866 886716171 H-C:C:C-H 888 67900359 H-C:C 951 -1046303447 C:C:C:C:C:C-P-C 957 -662167720 P-C:C:C 969 70654 H-C 971 1678681248 C:C:C-P-C:C:C 979 |
First, all the hash codes are unique. So clearly the issue lies in the RNG and indeed, we see the following two paths being mapped to the same random integer.
1 2 | -428753296 C:C:C:C:C:C 161 -245206314 H-C:C:C-P-C:C:C-H 161 |
Does this mean that the two hash values, when used as seeds to the RNG give the same sequence of random ints? Using the code below
1 2 3 4 5 | Random rng1 = new Random(-428753296); Random rng2 = new Random(-245206314); for (int i = 0; i < 5; i++) { System.out.println(rng1.nextInt(1024) + " " + rng2.nextInt(1024)); } |
we generate the first five random integers and we see that they match at the first value but then differ.
1 2 3 4 5 | 161 161 846 40 317 885 461 535 448 982 |
This suggests that instead of using the first random integer from the RNG seeded by a hash value, we use the second random integer. Modifying the code to do this still gives collisions in two bits. Once again, looking at the paths, hashes and bit positions, we see that now, two different paths get mapped to the same bit position.
1 2 | 886341376 H-C:C-P-C 686 1434821341 H-C:C:C:C:C-H 686 |
As before, we look at the sequence of random ints obtained from RNG’s seeded using these hash values. The resultant sequence looks like:
1 2 3 4 5 | 333 142 686 686 905 1022 70 571 177 384 |
So now, the two sequences match at the second value. OK, so what happens if we take the third value from the sequence and use that as a bit position? We get exactly the same behavior (collisions at two bit positions), except that now, when we look at the sequence of random int’s they match at the third value.
This behavior seems a little strange to me – as if there is a pair of seeds such that the “trajectory” of the sequences generated using those seeds will always (?) intersect at a certain point (where point actually corresponds to the n’th element of the sequences).
May be this is a property of random sequences? Or a feature of the Java RNG. I’d love to hear if anybody has insight into this behavior.
Picking 64 random positive integers below 1024, isn’t there only a 13% chance of avoiding a collision?
In [1]: p = 1
In [2]: for i in range(64):
…: p *= (1024 – i) / 1024.0
In [3]: p
Out[3]: 0.13388743455337332
True – but here we’re working two different random number sequences. So each time a bit position is required we’re instantiating a new RNG with a new seed (ie the hash value) and pulling a single random int from it.
Ths I’d have expected the probability of two separate RNG sequences intersecting to be low – but this is not a rigorous conclusion. Hopefully the math will prove me wrong
How is this thought experiment?
If we use a perfect (behaves ideally) random number generator ‘R1’, then the chance of avoiding collisions in the set of 64 unique hashes is 13%.
If we define a new random number generator ‘R2’ with the algorithm “use R1 but burn the first number” then it too must be perfect, and we’d still have 13% chance of avoiding collisions.
Generalize to Rn and I expect 13% of these ideal RNG’s will luckily avoid collisions
But you’re still considering a single RNG for obtaining all the 64 bit positions, right? Thus the R2 you describe is still based on the same sequence as R1 – and so you’re still getting all 64 values from the same sequence.
The fingerprinter code uses 64 *different* random number sequences. and pulls one value from each individual sequence.
I think the logic remains for 64 different seeds to the RNG because my hypothetical random number generators are ideal, which means (by definition) you can’t tell the difference!
Anyway, I’ve enjoyed this peek under the hood of fingerprinting in CDK. I assume other tools do it a similar way?
Yes, finally got my head around it – you’re right, different seeds shouldn’t make a difference
Hi
Very interesting topic. I have faced these challenges while working with fingerprints and here are few observations from my end. By the way I agree that mathematically the best is ~ 13%.
1) hashed FP (CDK) is good enough to separate patterns which are not common but on a very large dataset (in my case 10000+ mols), the performance dropped. Top 1% hits were good but then I started to loose specificity (esp when Tanimoto score was around 0.77).
2) First I thought it was an Tanimoto score but I wasn’t convinced incases where we had rings (close vs open). I ended up writing new FP based on the pubchem patterns as coded in the CDK and added few more patterns to resize it to 1024 from 881. Well! It’s works like magic and I could find much more serialised hits than before. I think the extensions of the fingerprint which I made based on the patterns in my db also helped.
At the end of the day I believe all these searches are heuristic and hashed FP is faster to generate but prone to bit clash where as SMARTS based fps are slower to generate as u sped time in matching patterns but are more sensitive and specific as u get what u see (patterns and bitset relationship is know).
Just a thought…..
Please ignore my previous reply…First answer was a rough draft! Found a lot of mistakes/typos.. courtesy my iphone auto text filler. This one is from my comp .. Rajarshi might like to delete it 😉
Hi
Very interesting topic. I have faced these challenges while working with fingerprints and here are few observations from my end. By the way I agree that mathematically the best bet is ~ 13%.
1) The hashed FP (CDK) is good enough to separate patterns which are not common but on a large dataset (in my case 10000+ mols), the performance drops drastically. Top 1% hits were good but then rest of the started to loose specificity (esp when Tanimoto score was around 0.77).
2) First I thought it was an artifact of the Tanimoto score… but I wasn’t convinced spl. in cases where we had rings (close vs open). I ended up writing a new FP based on the pubchem patterns as coded in the CDK and added few more patterns to resize it to 1024 from 881. Well! It’s works like magic and I could find much more serialised hits than before. I think the extensions of the fingerprint which I made based on the patterns in my db also helped.
At the end of the day, I believe that all these searches are heuristic and hashed FP is faster to generate but prone to bit clashes where as SMARTS based FPs are slower to generate (as u spend time in MCS) in matching patterns but they are more sensitive and specific as you can trace the patterns (u get what u see) as the patterns and bitset relationship is know and static.
Just a thought…..
Asad, if you don’t mind, I’d like tp post your comment as a blog post
[…] a comment to my previous post on bit collisions in hashed fingerprints, Asad reported on some interesting points which would be […]