Imputation

February 26, 2020 ยท View on GitHub

Changing the missing value imputation in vtreat

For this example, we will use the UnsupervisedTreatment, but the same parameters can be used with the other treatment plans as well.

A simple data example

Here we create a simple data set where the inputs have missing values.

library(vtreat)
## Loading required package: wrapr
d = data.frame(
    "x" = c(0, 1, 1000, NA),
    "w" = c(3, 6, NA, 100),
    "y" = c(0, 0, 1, 1)
)

knitr::kable(d)
xwy
030
160
1000NA1
NA1001

Some of the summary statistics of d. We're primarily interested in the inputs x and w.

summary(d)
##        x                w                y      
##  Min.   :   0.0   Min.   :  3.00   Min.   :0.0  
##  1st Qu.:   0.5   1st Qu.:  4.50   1st Qu.:0.0  
##  Median :   1.0   Median :  6.00   Median :0.5  
##  Mean   : 333.7   Mean   : 36.33   Mean   :0.5  
##  3rd Qu.: 500.5   3rd Qu.: 53.00   3rd Qu.:1.0  
##  Max.   :1000.0   Max.   :100.00   Max.   :1.0  
##  NA's   :1        NA's   :1

The default missing value imputation

By default, vtreat fills in missing values with the mean value of the column, and adds an advisory *_is_bad column to mark the location of the original missing values.

treatments <- designTreatmentsZ(d, 
                                varlist = c('x', 'w'), 
                                verbose = FALSE)
d_treated <- prepare(treatments, 
                     d)
d_treated$y <- d$y
knitr::kable(d_treated)
xx_isBADww_isBADy
0.000003.0000000
1.000006.0000000
1000.0000036.3333311
333.66671100.0000001

Changing the imputation strategy

If you do not want to use the mean to fill in missing values, you can change the imputation function using the parameter missingness_imputation. Here, we fill in missing values with the median.

median2 <- function(x, wts) {
  median(x)
}

treatments <- designTreatmentsZ(d, 
                                varlist = c('x', 'w'), 
                                verbose = FALSE,
                                missingness_imputation = median2)
d_treated <- prepare(treatments, 
                     d)
d_treated$y <- d$y
knitr::kable(d_treated)
xx_isBADww_isBADy
00300
10600
10000611
1110001

You can also use a constant value instead of a function. Here we replace missing values with the value -1.

treatments <- designTreatmentsZ(d, 
                                varlist = c('x', 'w'), 
                                verbose = FALSE,
                                missingness_imputation = -1)
d_treated <- prepare(treatments, 
                     d)
d_treated$y <- d$y
knitr::kable(d_treated)
xx_isBADww_isBADy
00300
10600
10000-111
-1110001

Changing the imputation strategy per column

You can control the imputation strategy per column via the map imputation_map. Any column not named in the imputation map will use the imputation strategy specified by the missingness_imputation parameter (which is the mean by default).

Here we use the maximum value to fill in the missing values for x and the value 0 to fill in the missing values for w.

max2 <- function(x, wts) {
  max(x)
}

treatments <- designTreatmentsZ(d, 
                                varlist = c('x', 'w'), 
                                verbose = FALSE,
                                imputation_map = list(
                                  x = max2,
                                  w = 0
                                ))
d_treated <- prepare(treatments, 
                     d)
d_treated$y <- d$y
knitr::kable(d_treated)
xx_isBADww_isBADy
00300
10600
10000011
1000110001

If we don't specify a column, vtreat looks atmissingness_imputation (in this case, -1).

treatments <- designTreatmentsZ(d, 
                                varlist = c('x', 'w'), 
                                verbose = FALSE,
                                missingness_imputation = -1,
                                imputation_map = list(
                                  x = max2
                                ))
d_treated <- prepare(treatments, 
                     d)
d_treated$y <- d$y
knitr::kable(d_treated)
xx_isBADww_isBADy
00300
10600
10000-111
1000110001

If missingness_imputation is not specified, vtreat uses a weighted mean.

treatments <- designTreatmentsZ(d, 
                                varlist = c('x', 'w'), 
                                verbose = FALSE,
                                imputation_map = list(
                                  x = max2
                                ))
d_treated <- prepare(treatments, 
                     d)
d_treated$y <- d$y
knitr::kable(d_treated)
xx_isBADww_isBADy
003.0000000
106.0000000
1000036.3333311
10001100.0000001