Question

I need to aggregate raw time-series data into candlesticks with 1m granularity. I built code with python-polars which returns this plan:

 SELECT [col("parsed_timestamp"), col("label1"), col("open"), col("high"), col("low"), col("close")] FROM
  AGGREGATE
        [
           col("price").first().alias("open"), 
           col("price").max().alias("high"), 
           col("price").min().alias("low"), 
           col("price").last().alias("close")
        ] BY [col("label1")]
FROM
     WITH_COLUMNS:
     [col("timestamp").strict_cast(Datetime(Microseconds, None)).alias("parsed_timestamp")]
      Csv SCAN [ticker.csv]
      PROJECT */11 COLUMNS

I use scan_csv and collect(streaming=True). However this code still uses 30Gb of memory when aggregating 30Gb file. Is there any optimizations I can do? Data is sorted by timestamp in ascending order.

I made a simple program that just loops over each row, updates candlesticks in memory map, and then dumps them once they are finished into stdout/file. It doesn’t use any additional memory at all. I assume something like this should be possible with polars as well?

Python pola.rs OOM on a 30Gb file

LEAVE A COMMENT Hủy