R targets: Best strategy for brannching over larger than memory objects

  Kiến thức lập trình

I have a {targets} pipeline that downloads thousands of weather station data. To do this efficiently I’m using dynamic branching and crew for parallel work, with station-id as the pattern.

Data for each station alone is quite small (10k days). But after I download all data, I need to run an analysis for each. Thus I added a target that uses station-id as the pattern and pass the filtered dataset to the function. This is crashing my machine (I have 32Gb ram). Each worker is consuming large amounts of data, as if each worker is receiving a complete copy of all the weather station.

Is there a way to branch over a large object without having to load it all to memory? I thought about using tar_group but that relies on dplyr::group_by and my entire workflow uses data.table

Here goes a reprex of what I’m trying to do

library(targets)

tar_script({
  tar_option_set(
    controller = crew::crew_controller_local(workers = 2),
    memory = "transient",
    garbage_collection = TRUE,
    storage = "worker"
  )
  library(crew)
  tar_pipeline(
    tar_target(ids,
               1:100),
    tar_target(station_data,
               data.frame(id = ids,
                          val = runif(1000000)),
               pattern = map(ids)),
    tar_target(corrected_data,
    sort(station_data[station_data$id == ids, 'val']),
    pattern = ids)
  )
  },
  ask = FALSE
)

tar_make()

LEAVE A COMMENT