Why does this padding function work alone but not as a PySpark UDF?

  Kiến thức lập trình

I have the following function:

def pad(array, target_size=(224, 224), pad_value=255):
    # calculate padding
    pad_h = (target_size[0] - array.shape[0]) // 2
    pad_w = (target_size[1] - array.shape[1]) // 2

    # pad the image
    padded_array = np.pad(array, ((pad_h, pad_h), (pad_w, pad_w), (0, 0)), mode='constant', constant_values=(pad_value, pad_value))

    # convert to bytes
    processed_array_bytes = padded_array.tobytes()

    return processed_array_bytes

def top_level(input_data):
  array = first_steps(input_data)
  final_array = pad(array)
  return final_array

When I run the input directly into the top-level function it works, however, it does not work as a UDF:

test_udf = udf(top_level, BinaryType())
result_df = df.withColumn('arrays', test_udf(df['column_to_use']))

Somehow, as a UDF this returns ValueError: index can't contain negative values. I’ve pinpointed that this padding step is where the error is coming up, but I have no idea what’s going wrong or how to resolve it. Any help would be appreciated, thank in advance.

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT