Remove rows if a certain value is reached, and recalculate

  Kiến thức lập trình

I have a dataset with GPS points and I want to remove points that are within a 2-hour period. Here’s a sample of the dataset:

       gps_data_animals_id    acquisition_time
348179              348179 2015-09-18 00:00:00
348180              348180 2015-09-18 01:45:00
348181              348181 2015-09-18 02:00:00
348182              348182 2015-09-18 02:15:00
348183              348183 2015-09-18 02:30:00
348184              348184 2015-09-18 04:30:00
348185              348185 2015-09-18 04:45:00
348186              348186 2015-09-18 05:00:00
348187              348187 2015-09-18 06:00:00
348188              348188 2015-09-18 12:00:00
348189              348189 2015-09-18 17:15:00
348190              348190 2015-09-18 17:30:00
348191              348191 2015-09-18 17:45:00
348192              348192 2015-09-18 18:00:00
348193              348193 2015-09-18 18:15:00
348194              348194 2015-09-18 18:30:00
348195              348195 2015-09-18 18:45:00
348196              348196 2015-09-19 00:00:00
348197              348197 2015-09-19 06:01:00
348198              348198 2015-09-19 11:15:00

And I want locations separated in time by at least 2h, so this would be the filtered dataset:

       gps_data_animals_id    acquisition_time
348179              348179 2015-09-18 00:00:00
348181              348181 2015-09-18 02:00:00
348184              348184 2015-09-18 04:30:00
348188              348188 2015-09-18 12:00:00
348189              348189 2015-09-18 17:15:00
348196              348196 2015-09-19 00:00:00
348197              348197 2015-09-19 06:01:00
348198              348198 2015-09-19 11:15:00

I’ve been playing a bit with the lag() function as it seems to do more or less what I need, but I end up removing more than I want. This is what I have done so far:

dataset$time_diff <- unlist(tapply(dataset$acquisition_time, INDEX = dataset$animals_id,
                                 FUN = function(x) c(0, `units<-`(diff(x), "hours"))))

And then I would remove those values of time_diff less than 2h, but that ends up removing more than I want because it would also remove e.g. gps_data_animals_id = 348181, which I want to keep as it has the 2h interval with the first location.

What I think it could work: sequentially select the first two rows, calculate the time difference and remove the second row if the time difference would be less than 2h. And then group the two first rows again and repeat the process. But I’m not sure how to do that, code-wise.

Any thoughts?

Here’s the reproducible example of the dataset:

structure(list(gps_data_animals_id = 348179:348198, acquisition_time = structure(c(1442534400, 
1442540700, 1442541600, 1442542500, 1442543400, 1442550600, 1442551500, 
1442552400, 1442556000, 1442577600, 1442596500, 1442597400, 1442598300, 
1442599200, 1442600100, 1442601000, 1442601900, 1442620800, 1442642460, 
1442661300), class = c("POSIXct", "POSIXt"), tzone = "GMT")), row.names = 348179:348198, class = "data.frame")

Recognized by R Language Collective

library(dplyr)
library(purrr)

df1 %>% 
  filter(accumulate(c(120, diff(acquisition_time)), 
                    ~ifelse(.x + .y <= 120, .x + .y, .y)) >= 120)

#>   gps_data_animals_id    acquisition_time
#> 1              348179 2015-09-18 00:00:00
#> 2              348181 2015-09-18 02:00:00
#> 3              348184 2015-09-18 04:30:00
#> 4              348188 2015-09-18 12:00:00
#> 5              348189 2015-09-18 17:15:00
#> 6              348196 2015-09-19 00:00:00
#> 7              348197 2015-09-19 06:01:00
#> 8              348198 2015-09-19 11:15:00

Created on 2024-07-11 with reprex v2.0.2

Recognized by R Language Collective

0

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website

LEAVE A COMMENT