Relative Content

Tag Archive for pythonapache-sparkpysparkdatabricks

Spark getItem shortcut

I am doing the following in spark sql:

Spark Read Format to handle delimiter, escape and multiple quotes in a column

I am trying to read a source file which has a column called Description. The delimiter is ‘,’. This is column can have special character, string, numbers.

Pyspark filter via like multiple conditions

I would like to use the following combination of like and any in pyspark in the most pythonic way.

How to read a Nested JSON in Pyspark

I am trying to read a nested JSON downloaded from: text

Read CSV Files into Spark DataFrame using For loop in PySpark

I am trying to import csv files saved in an Azure data container into a spark dataframe using a for loop. I am running this code on Azure Databricks. The for loop runs without any error. However I am unable to access the dataframe by its name.
I am able to access the teams dataframe by using df_spark.show()

How to optimally index & profile a Pyspark dataframe?

I am having a very large dataset as multiple parquets (like around 20,000 small files) which I am reading into a pyspark dataframe. I want to add an index column in this dataframe and then do some data profiling and data quality check activities. I’m sharing a portion of the code.
I’ve tried both monotonically_increasing_id as well as zipWithIndex. I’ve seen in every forums that zipWithIndex is best for performance but for me it’s the other way around. Following is my benchmarks for indexing the table using both:

pyspark code on databricks never completes execution and hang in between

I have two data frames: df_selected and df_filtered_mins_60

Databricks Spark throwing [GC (Allocation Failure) ] message

I used this code to update a new_df. Idea is to get all the records between date_updated and stop time and assign them a number which i will used in group by in next steps. so basically assigning same number to every group between dateupdated and stop time.

Thiết kế website giá rẻ

Danh mục