Question

I have two lists (sizes m and n) containing high dimensional bit-vectors. All vectors have the same number of dimensions and use Hamming distance as measure if distance.

For each element in the first list I want to find the closest elements in the second list. Such a closest element may differ by several thousand bits from the element I’m searching for.

The naive approach would be computing the hamming distance for each pair of vectors, but that has runtime O(m*n) making it infeasible. So I’m looking for an algorithm that’s significantly faster.

Lets say I have d=10000, m=1 billion and n=100 billion and I want the algorithm to terminate in a couple of CPU days.

The elements in the first list are created by taking a random element from the second list and flipping each bit with the same probability p < 0.5. I want to support values of p that are as close as possible to 0.5. I’m fine with probabilistic algorithms that find matches with high probability.

Finding similar high dimensional vectors

LEAVE A COMMENT Hủy