I have a dataframe with a country descriptor ds_pais
in English. I want to use GoogleTranslator
to add a column via .withColumn
that translates from english to spanish that country descriptor.
from deep_translator import GoogleTranslator
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def translate(to_translate):
return GoogleTranslator(source='en', target='es').translate(to_translate)
translate_udf = udf(lambda x: translate(x), StringType())
df_pais_traducido = df_campo_calculado.withColumn('ds_pais_es', translate_udf(df_campo_calculado.ds_pais))
display(df_pais_traducido.select('ds_pais', 'ds_pais_es'))
but when I run it I get this Error
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 815, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 651, in read_udfs
udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 376, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 88, in read_command
command = serializer._read_with_length(file)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 174, in _read_with_length
return self.loads(obj)
File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 472, in loads
return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'deep_translator'
Why isn’t the udf finding GoogleTranslator
?
EDIT: I’m running this on a Microsoft Fabric Notebook using PySpark (Python).
1