Finding similar high dimensional vectors

  softwareengineering

I have two lists (sizes m and n) containing high dimensional bit-vectors. All vectors have the same number of dimensions and use Hamming distance as measure if distance.

For each element in the first list I want to find the closest elements in the second list. Such a closest element may differ by several thousand bits from the element I’m searching for.

The naive approach would be computing the hamming distance for each pair of vectors, but that has runtime O(m*n) making it infeasible. So I’m looking for an algorithm that’s significantly faster.

Lets say I have d=10000, m=1 billion and n=100 billion and I want the algorithm to terminate in a couple of CPU days.


The elements in the first list are created by taking a random element from the second list and flipping each bit with the same probability p < 0.5. I want to support values of p that are as close as possible to 0.5. I’m fine with probabilistic algorithms that find matches with high probability.

7

Just an intuition. I haven’t gone through the details and may be embarrassingly mistaken.

As your vectors all have the same dimensionality d and their distance is defined as their hamming distance you can represent each vector as a point on a d-dimensional hyper cube. The path between two points will be the hamming distance of their coordinates.

All points in n are marked. This takes O(n) time. Then for each point in m you execute a breadth first search for any marked node. This will take O(m * (|V|+|E|)) time. Altogether O(n + m * (|V|+|E|)) time. However as the number of vertices and edges are derived from d and d is a constant we are left with O(n + m).

You could bring this constant factor down by applying a more efficient search algorithm. Eg: Efficiently find binary strings with low Hamming distance in large set

4

LEAVE A COMMENT