Relative Content

Tag Archive for apache-sparkspark-structured-streaming

Apache Spark structured streaming file does not exist error while writing

I’m developing a clickstream project that collects user events and stores them within HDFS.
You can see the project architecture on the diagram:

As known, spark structured streaming support that stream join static, but how to refresh the static data when the static data changed

In my spark structured streaming application, the static data is updated per day. But after the application started, the static data in the spark application memory is never updated.
It caused that the result from the second day is incorrect.

Spark Arbitrary Stateful Operations – how does Spark manage the timeout?

I’m looking at this example and also tried to apply it for my use case, but I couldn’t understand how Spark uses the hasTimedOut to manage the data in memory. For example, what happen if I handle a session by its key and stop receiving events for this specific key – that’s mean I’m no longer going to get into the mergeSessions function for this key, and consequently never will evict the events that related to this key so they will remain in the memory.

spark structured streaming window

i want to process all the data in the 60 seconds window, but i found data which belongs to the previous window, how to avoid this?

OOM Issue in Streaming with foreachBatch()

I have a stateless streaming application that uses foreachBatch. This function executes between 10-400 times each hour based on custom logic.

(Why) does spark structured streaming recompile the code for each mini-batch

I have a spark structured streaming job, reading from Kafka, parsing avro, exploding a column, computing some extra columns as simple combinations (sum/product/division) of existing columns, and write the result to delta table. No windows or state, and not using foreachbatch.

structured streaming join plus agg

spark version: 3.5.1 local mode

Unbounded table contains old data in Spark structured streaming?

I’m using spark structured streaming. I don’t understand some of its mechanics. Does an Unbounded table keep old data in memory? If yes, is there any way to delete old data in Unbounded table? In case of using output append, does the old data in the Unbounded table still exist?

Thiết kế website giá rẻ

Danh mục