pyspark UDF raises “No module named” error

  Kiến thức lập trình

I have a dataframe with a country descriptor ds_pais in English. I want to use GoogleTranslator to add a column via .withColumn that translates from english to spanish that country descriptor.

from deep_translator import GoogleTranslator
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def translate(to_translate):
    return GoogleTranslator(source='en', target='es').translate(to_translate)

translate_udf = udf(lambda x: translate(x), StringType())

df_pais_traducido = df_campo_calculado.withColumn('ds_pais_es', translate_udf(df_campo_calculado.ds_pais))

display(df_pais_traducido.select('ds_pais', 'ds_pais_es'))

but when I run it I get this Error

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 815, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 651, in read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 376, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 88, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 174, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 472, in loads
    return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'deep_translator'

Why isn’t the udf finding GoogleTranslator?

EDIT: I’m running this on a Microsoft Fabric Notebook using PySpark (Python).

1

LEAVE A COMMENT