Relative Content

Tag Archive for apache-sparkpysparkapache-spark-sql

Optimizing a complex pyspark join

I have a complex join that I’m trying to optimize
df1 has cols id,main_key,col1,col1_isnull,col2,col2_isnull…col30
df2 has cols id,main_key,col1,col2..col_30

Spark job spilling data vs OOM

I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).

Spark job spilling data vs OOM

I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).

Spark job spilling data vs OOM

I am using spark sql to run sql jobs using 10G executor memory.
When I am monitoring using Spark UI, I can see that data is being spilled to Disk and Memory (expected doing some explode operations ).

PySpark Window functions: Aggregation differs if WindowSpec has sorting

I am working through this example of aggregation functions for PySpark Window.

In spark physical planning, what kind of optimization is happening?

I have a quote from a blog on the physical planning in spark. I want to understand what it means

Max of a table partition column vs Max of result of show partitions of the same table

Suppose I have a table that contains orders of customers, partitioned (only) by date_transaction. I want to find the maximum value of date_transaction.

value of another column that is the same row as my last lag value

I have a timeseries dataset and am looking to make a new column that represents the last reported values (not null). I think I have this part figured out, using a combination of lag and last

spark weird broadcasting size

I came across a very weird discovery today.

ShutdownHookManager Error in Spark with Custom Temporary Directory Configuration

I am encountering a ShutdownHookManager error when running Spark with a custom temporary directory configuration. My directory structure and configuration details are as follows:

Thiết kế website giá rẻ

Danh mục

Relative Content

Tag Archive for apache-sparkpysparkapache-spark-sql

Optimizing a complex pyspark join

Spark job spilling data vs OOM

Spark job spilling data vs OOM

Spark job spilling data vs OOM

PySpark Window functions: Aggregation differs if WindowSpec has sorting

In spark physical planning, what kind of optimization is happening?

Max of a table partition column vs Max of result of show partitions of the same table

value of another column that is the same row as my last lag value

spark weird broadcasting size

ShutdownHookManager Error in Spark with Custom Temporary Directory Configuration