Optimizing a complex pyspark join
I have a complex join that I’m trying to optimize
df1 has cols id,main_key,col1,col1_isnull,col2,col2_isnull…col30
df2 has cols id,main_key,col1,col2..col_30
Spark job spilling data vs OOM
I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).
Spark job spilling data vs OOM
I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).
Spark job spilling data vs OOM
I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).
PySpark Window functions: Aggregation differs if WindowSpec has sorting
I am working through this example of aggregation functions for PySpark Window
.
In spark physical planning, what kind of optimization is happening?
I have a quote from a blog on the physical planning in spark. I want to understand what it means
Max of a table partition column vs Max of result of show partitions of the same table
Suppose I have a table that contains orders of customers, partitioned (only) by date_transaction
. I want to find the maximum value of date_transaction
.
value of another column that is the same row as my last lag value
I have a timeseries dataset and am looking to make a new column that represents the last reported values (not null). I think I have this part figured out, using a combination of lag
and last
spark weird broadcasting size
I came across a very weird discovery today.
ShutdownHookManager Error in Spark with Custom Temporary Directory Configuration
I am encountering a ShutdownHookManager error when running Spark with a custom temporary directory configuration. My directory structure and configuration details are as follows: