I need to aggregate raw time-series data into candlesticks with 1m granularity. I built code with python-polars
which returns this plan:
SELECT [col("parsed_timestamp"), col("label1"), col("open"), col("high"), col("low"), col("close")] FROM
AGGREGATE
[
col("price").first().alias("open"),
col("price").max().alias("high"),
col("price").min().alias("low"),
col("price").last().alias("close")
] BY [col("label1")]
FROM
WITH_COLUMNS:
[col("timestamp").strict_cast(Datetime(Microseconds, None)).alias("parsed_timestamp")]
Csv SCAN [ticker.csv]
PROJECT */11 COLUMNS
I use scan_csv
and collect(streaming=True)
. However this code still uses 30Gb of memory when aggregating 30Gb file. Is there any optimizations I can do? Data is sorted by timestamp in ascending order.
I made a simple program that just loops over each row, updates candlesticks in memory map, and then dumps them once they are finished into stdout/file. It doesn’t use any additional memory at all. I assume something like this should be possible with polars as well?