I want to create a new column “X11” that sums up all the 1’s conditionally based on how many NA’s there are within a select number of columns. In this case, I am looking at 4 variables: X1, X2, X3, and X4.
Ex: if there is 1 NA, then I want to look at the remaining 3 variables that have values and count how many 1’s there are. If there are 2 NA’s, then I want to look at the remaining 2 variables and count how many 1’s there are. If I have 3 NA’s, then I want to look at the remaining 1 variable and determine if it is a 1. If I have all 4 NA’s, then this would give me 0.
I have this data:
df <- data.frame(replicate(10,sample(0:2, 10, rep=TRUE)))
df <- replace(df, df == 0, NA)
My data frame looks like this:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 1 NA 1 NA NA NA 1 1 2
2 NA 1 1 NA 2 NA 2 2 NA 1
3 1 NA 1 1 NA NA 1 2 NA 1
4 2 2 2 1 1 2 1 NA 2 2
5 NA 2 NA 2 NA 2 1 NA 1 1
6 2 2 1 1 2 NA 1 2 1 1
7 1 2 NA NA 2 1 1 NA NA 1
8 2 2 NA NA 1 NA NA 2 NA 1
9 1 NA 1 2 2 1 2 NA NA 1
10 NA 2 1 NA NA NA NA 2 2 NA
I want my output to look like this:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
1 1 1 NA 1 NA NA NA 1 1 2 3
2 NA 1 1 NA 2 NA 2 2 NA 1 2
3 1 NA 1 1 NA NA 1 2 NA 1 3
4 2 2 2 1 1 2 1 NA 2 2 1
5 NA 2 NA 2 NA 2 1 NA 1 1 0
6 2 2 1 1 2 NA 1 2 1 1 2
7 1 2 NA NA 2 1 1 NA NA 1 1
8 2 2 NA NA 1 NA NA 2 NA 1 0
9 1 NA 1 2 2 1 2 NA NA 1 2
10 NA 2 1 NA NA NA NA 2 2 NA 1
Here is an example of my current code:
vars <- c("X1", "X2", "X3", "X4")
df <- df %>%
mutate(missing_vars = rowSums(across(vars, ~is.na(.))),
nonmissing_vars = 7-vars)
df <- df %>%
mutate(zero_na = case_when(missing_vars == 0 & (X1 == 2 & X2 == 2 & X3 == 2 & X4 == 2) ~ 1,
(missing_vars == 0 & (X1 == 1 & X2 == 2 & X3 == 2 & X4 == 2) |
(X1 == 2 & X2 == 1 & X3 == 2 & X4 == 2) |
(X1 == 2 & X2 == 2 & X3 == 1 & X4 == 2) |
(X1 == 2 & X2 == 2 & X3 == 2 & X4 == 1)) ~ 2,
(missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 2 & X4 == 2) |
(X1 == 1 & X2 == 2 & X3 == 1 & X4 == 2) |
(X1 == 1 & X2 == 2 & X3 == 2 & X4 == 1) |
(X1 == 2 & X2 == 1 & X3 == 1 & X4 == 2) |
(X1 == 2 & X2 == 2 & X3 == 1 & X4 == 1) |
(X1 == 2 & X2 == 1 & X3 == 2 & X4 == 1)) ~ 3,
(missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 1 & X4 == 2) |
(X1 == 1 & X2 == 1 & X3 == 2 & X4 == 1) |
(X1 == 1 & X2 == 2 & X3 == 1 & X4 == 1) |
(X1 == 2 & X2 == 1 & X3 == 1 & X4 == 1)) ~ 4,
missing_vars == 0 & (X1 == 1 & X2 == 1 & X3 == 1 & X4 == 1) ~ 5))
brfss <- brfss %>%
mutate(one_na = case_when(missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 2 & X4 == 2) ~ 1,
missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 2 & X4 == 2) ~ 1,
missing_vars == 1 & (X1 == 2 & X2 == 2 & is.na(X3) & X4 == 2) ~ 1,
missing_vars == 1 & (X1 == 2 & X2 == 2 & X3 == 2 & is.na(X4)) ~ 1,
missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 2 & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 2 & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 1 & X2 == 2 & is.na(X3) & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 1 & X2 == 2 & X3 == 2 & is.na(X4)) ~ 2,
missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 1 & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 1 & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 2 & X2 == 1 & is.na(X3) & X4 == 2) ~ 2,
missing_vars == 1 & (X1 == 2 & X2 == 1 & X3 == 2 & is.na(X4)) ~ 2,
missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 2 & X4 == 1) ~ 2,
missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 2 & X4 == 1) ~ 2,
missing_vars == 1 & (X1 == 2 & X2 == 2 & is.na(X3) & X4 == 1) ~ 2,
missing_vars == 1 & (X1 == 2 & X2 == 2 & X3 == 1 & is.na(X4)) ~ 2,
missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 1 & X4 == 2) ~ 3,
missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 1 & X4 == 2) ~ 3,
missing_vars == 1 & (X1 == 1 & X2 == 1 & is.na(X3) & X4 == 2) ~ 3,
missing_vars == 1 & (X1 == 1 & X2 == 1 & X3 == 2 & is.na(X4)) ~ 3,
missing_vars == 1 & (is.na(X1) & X2 == 2 & X3 == 1 & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 2 & is.na(X2) & X3 == 1 & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 2 & X2 == 1 & is.na(X3) & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 2 & X2 == 1 & X3 == 1 & is.na(X4)) ~ 3,
missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 2 & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 2 & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 1 & X2 == 2 & is.na(X3) & X4 == 1) ~ 3,
missing_vars == 1 & (X1 == 1 & X2 == 2 & X3 == 1 & is.na(X4)) ~ 3,
missing_vars == 1 & (is.na(X1) & X2 == 1 & X3 == 1 & X4 == 1) ~ 4,
missing_vars == 1 & (X1 == 1 & is.na(X2) & X3 == 1 & X4 == 1) ~ 4,
missing_vars == 1 & (X1 == 1 & X2 == 1 & is.na(X3) & X4 == 1) ~ 4,
missing_vars == 1 & (X1 == 1 & X2 == 1 & X3 == 1 & is.na(X4)) ~ 4))
I repeat this with each combination for 2 NA’s, 3 NA’s and then 4 NA’s and then sum “zero_na”, “one_na”, etc. to get the final count of the value under X11.
However, I currently have about 300,000 observations and need to do this across 7 different variables with varying amounts of NA’s, 1’s and 2’s. This will be a ridiculous amount of combinations I will have to write and I just wanted to know if there is a more efficient way of writing this code?
Thank you so much in advance!
1