Relative Content

Tag Archive for pysparkapache-spark-sql

Pyspark SQL understanding append and overwrite without a primary key

I have a notebook that reads many parquet files in /source/Year=x/Month=y/Day=z, which is partitioned by the year, month, and day of the date the file was loaded (this is important, it is NOT the date of any particular field in the data itself). There’s a recordTimestamp field that is generally the same as the day before the load date – /source is generated elsewhere each day from an extract. But the recordTimestamp could actually be anything, such as a year old record that was previously not included for whatever reason.

Pyspark SQL understanding append and overwrite without a primary key

I have a notebook that reads many parquet files in /source/Year=x/Month=y/Day=z, which is partitioned by the year, month, and day of the date the file was loaded (this is important, it is NOT the date of any particular field in the data itself). There’s a recordTimestamp field that is generally the same as the day before the load date – /source is generated elsewhere each day from an extract. But the recordTimestamp could actually be anything, such as a year old record that was previously not included for whatever reason.