SparkSubmitOperator for submitting pyspark applications in airflow that load data from a volume mount

  Kiến thức lập trình

I am running spark and airflow as separate docker containers. My spark scripts and the data they load are mounted as a volume with the same path in both the spark and airflow containers. My airflow Dockerfile installs the apache-airflow-providers-apache-spark package which installs pyspark as a dependency. Therefore, I didn’t download the spark binaries.

Spark is able to establish a connection with the spark containers and kick off a job using the SparkSubmitOperator. However, when data is loaded from a volume, I get this error:

pyspark.errors.exceptions.captured.AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/usr/local/spark/resources/data/*.json.

What is the correct way to load data in Spark from a mounted volume when submitting the job via Airflow using the SparkSumbitOperator?

Any help would be appreciated in advance! Thanks 🙂

I tried installing the Spark binaries, but that stopped Airflow from even connecting to Spark. I checked the Java and PySpark versions installed in Airflow and Spark to makes sure they match. And I ensured that the data existed in both Airflow and Spark at the same path.

New contributor

optimus_coprime is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.