Memory Leak in pyspark Driver Node in databricks

  Kiến thức lập trình

I’m encountering a memory issue in my PySpark application in databricks. where the memory usage on the driver node keeps increasing over time until it eventually crashes due to an out-of-memory (OOM) error. The problem occurs when I execute the following function in a loop for each image in a folder:

Python

def read_image_and_save_to_df(image_path, save_path):
    # Read image (tif)
    # Convert it to a PySpark DataFrame
    # Save to Parquet format

# For each image in the folder:
for image_path in image_folder:
    read_image_and_save_to_df(image_path, save_path)

I’ve tried cleaning up objects at the end of the function and clearing cache and unpersisting RDDs in the SparkContext, but the memory usage still grows. It seems like a memory leak, but I’m not sure why it’s happening. Any ideas on how to troubleshoot this issue?

memory usage before clearing cache at the end of the function:
memory usage before clearing cache at the end of the function:

memory usage after clearing cache at the end of the function:

memory usage after clearing cache at the end of the function:

LEAVE A COMMENT