Relative Content

Tag Archive for pythonapache-sparkpysparkdatabricks

Read CSV Files into Spark DataFrame using For loop in PySpark

I am trying to import csv files saved in an Azure data container into a spark dataframe using a for loop. I am running this code on Azure Databricks. The for loop runs without any error. However I am unable to access the dataframe by its name.
I am able to access the teams dataframe by using df_spark.show()

How to optimally index & profile a Pyspark dataframe?

I am having a very large dataset as multiple parquets (like around 20,000 small files) which I am reading into a pyspark dataframe. I want to add an index column in this dataframe and then do some data profiling and data quality check activities. I’m sharing a portion of the code.
I’ve tried both monotonically_increasing_id as well as zipWithIndex. I’ve seen in every forums that zipWithIndex is best for performance but for me it’s the other way around. Following is my benchmarks for indexing the table using both:

Databricks Spark throwing [GC (Allocation Failure) ] message

I used this code to update a new_df. Idea is to get all the records between date_updated and stop time and assign them a number which i will used in group by in next steps. so basically assigning same number to every group between dateupdated and stop time.