Spark approximate N-nearest neightbor join using cosine similarity
I have two spark DataFrames A
and B
with the same schema. They contain text and the embedding vector of the text pre-calculated using a model such as OpenAI ADA v2 or similar. Example: