speeding up cosine similarity

  Kiến thức lập trình

I am trying to find the similarity between many molecules and a single protein, using their molecular representation. The molecular representation of these elements are obtained as follows:

  1. One form rdkit library mol = rdkit.Chem.MolFromSmiles(smiles)
  2. The other is read from a file.

Each mol representation is a list of size 1024.
There are 600,000 molecules but only a single protein

The process of calculating the cosine similarity, depicted here, is taking an excruciatingly long time, even after parallelizing the process across the 4 CPUs available (pool.map() is slower, but I need to preserve order) across threads.

I tried using GPUs but none of the processes seem to be optimized for GPU usage as I can only see CPU activity and zero GPU utilization.

The code provided here is only running the first 10,000 and still taking over 2 hours.

I am looking for suggestion on how to improve the speed of processing.

from sklearn.metrics.pairwise import cosine_similarity
import rdkit
from rdkit import Chem

# Function to compute chemical similarity
def compute_similarity(mol1, mol2):
    if mol1 is not None and mol2 is not None:
        fp1 = Chem.RDKFingerprint(mol1)
        fp2 = Chem.RDKFingerprint(mol2)
        similarity = cosine_similarity([fp1], [fp2])[0][0]
        return similarity
    else:
        return None

def compute_similarity_wrapper(args):
    return compute_similarity(*args)

# The calculation for similarity takes too long. Let's parallelize it
import multiprocessing
from multiprocessing import Pool

molecule_mols_subset = molecule_mols[:10_000]

# Get the number of available processors
num_processors = multiprocessing.cpu_count()
print(f" {num_processors} CPUs ---- available for parallel processing")

# Use multiprocessing.Pool to parallelize the computation
# Use the "map" function to preserve the order after parallel computing.
with multiprocessing.Pool(processes=num_processors) as pool:
    sims = pool.map(compute_similarity_wrapper, [(mol, alb_mol) for mol in molecule_mols_subset])

sims[:10]

LEAVE A COMMENT