Why does this padding function work alone but not as a PySpark UDF?

  Kiến thức lập trình

I have the following function:

def pad(array, target_size=(224, 224), pad_value=255):
    # calculate padding
    pad_h = (target_size[0] - array.shape[0]) // 2
    pad_w = (target_size[1] - array.shape[1]) // 2

    # pad the image
    padded_array = np.pad(array, ((pad_h, pad_h), (pad_w, pad_w), (0, 0)), mode='constant', constant_values=(pad_value, pad_value))

    # convert to bytes
    processed_array_bytes = padded_array.tobytes()

    return processed_array_bytes

def top_level(input_data):
  array = first_steps(input_data)
  final_array = pad(array)
  return final_array

When I run the input directly into the top-level function it works, however, it does not work as a UDF:

test_udf = udf(top_level, BinaryType())
result_df = df.withColumn('arrays', test_udf(df['column_to_use']))

Somehow, as a UDF this returns ValueError: index can't contain negative values. I’ve pinpointed that this padding step is where the error is coming up, but I have no idea what’s going wrong or how to resolve it. Any help would be appreciated, thank in advance.

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website