How to drop records after date based on condition
I’m looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of ‘TEST_COMPONENT’ being ‘UNSATISFACTORY’, based on their ‘TEST_DT’ value for each ID.
Difference between spark sql insert overwrite table and spark dataframe partition by overwrite mode
I have been testing spark jobs with hive tables backed by google buckets.
What does shuffle intermediate buffer on the Map side mean?
I am trying to understand Spark memory management and I came across this blog. In that the author mentions one of the usages of Execution memory in Spark:
Pyspark with liquid clustering
I have an existing dataframe stored in Azure storage. How can I enable liquid clustering on it?
collect_set and size – on medium size data overrunning
I have a dataframe containing columns: device_id
, country
, language
, channel
, genre
, and few others attributes . The data is partitioned by year, month, day and hour.
Spark Catalog doesn’t see the database that I created
I have been learning Spark (3.5.0) and I tried out the following exercise:
How to partition data in Spark when reading data from a MySQL table with string type primary key
I’m reading data from a MySQL table in Spark. The table structure may like:
EMR Spark shuffle FetchFailedException with 65tb data with AQE enabled
I am getting error Spark shuffle FetchFailedException while executing spark in emr with 65 TB input data. The code is aggregated metrics spark sql on top of s3 parquert reading 30 days worth of data .
Executor distribution across nodes in a cluster
How are executors of a Spark application distributed across the nodes of a cluster? Let’s say Spark is running in Cluster mode with YARN as the manager. The cluster is said to have 6 nodes, 16 cores each and 64GB mem. With the following configuration, how are the executors distributed across the cluster:
Number of cores in an executor and OOM error
I have read some articles on OOM error in executor of a Spark application and a number of the mention high concurrency as one of the possible reasons. I am aware that the concurrency is determined by the number of cores which determine maximum number of tasks that can run within an executor.