Python awswrangler performance under large number of partitions
I need to store/fetch data using two hierarchy levels, date and class. So when I upload data to S3 as part of the ETL pipeline, I’m using awswrangler
‘s to_parquet
function with partition_cols=["date", "class"]
. To fetch data from the S3 bucket, I’m using the read_parquet
function with partition_filter=filter_func
, where is similar to filter_func=lambda x: x["date"] in date_list and x["class"] in class_list
.