Relative Content

Tag Archive for apache-sparkpysparkapache-spark-sql

How to drop records after date based on condition

I’m looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of ‘TEST_COMPONENT’ being ‘UNSATISFACTORY’, based on their ‘TEST_DT’ value for each ID.

Executor distribution across nodes in a cluster

How are executors of a Spark application distributed across the nodes of a cluster? Let’s say Spark is running in Cluster mode with YARN as the manager. The cluster is said to have 6 nodes, 16 cores each and 64GB mem. With the following configuration, how are the executors distributed across the cluster:

Number of cores in an executor and OOM error

I have read some articles on OOM error in executor of a Spark application and a number of the mention high concurrency as one of the possible reasons. I am aware that the concurrency is determined by the number of cores which determine maximum number of tasks that can run within an executor.