I have a {targets}
pipeline that downloads thousands of weather station data. To do this efficiently I’m using dynamic branching and crew
for parallel work, with station-id as the pattern.
Data for each station alone is quite small (10k days). But after I download all data, I need to run an analysis for each. Thus I added a target that uses station-id as the pattern and pass the filtered dataset to the function. This is crashing my machine (I have 32Gb ram). Each worker is consuming large amounts of data, as if each worker is receiving a complete copy of all the weather station.
Is there a way to branch over a large object without having to load it all to memory? I thought about using tar_group
but that relies on dplyr::group_by
and my entire workflow uses data.table
…
Here goes a reprex of what I’m trying to do
library(targets)
tar_script({
tar_option_set(
controller = crew::crew_controller_local(workers = 2),
memory = "transient",
garbage_collection = TRUE,
storage = "worker"
)
library(crew)
tar_pipeline(
tar_target(ids,
1:100),
tar_target(station_data,
data.frame(id = ids,
val = runif(1000000)),
pattern = map(ids)),
tar_target(corrected_data,
sort(station_data[station_data$id == ids, 'val']),
pattern = ids)
)
},
ask = FALSE
)
tar_make()