Get rid of shuffle/CartesianRDD from the execution plan – Spark Structured Streaming
I have the following problem:
There is a Spark Structured Streaming query that runs forEachBatch and executes custom Python code as arrowOptimized Spark UDFs. The code is relatively complex. The general idea is: