Imputation
February 26, 2020 ยท View on GitHub
Changing the missing value imputation in vtreat
For this example, we will use the UnsupervisedTreatment, but the same parameters can be used with the other treatment plans as well.
A simple data example
Here we create a simple data set where the inputs have missing values.
library(vtreat)
## Loading required package: wrapr
d = data.frame(
"x" = c(0, 1, 1000, NA),
"w" = c(3, 6, NA, 100),
"y" = c(0, 0, 1, 1)
)
knitr::kable(d)
| x | w | y |
|---|---|---|
| 0 | 3 | 0 |
| 1 | 6 | 0 |
| 1000 | NA | 1 |
| NA | 100 | 1 |
Some of the summary statistics of d. We're primarily interested in the inputs x and w.
summary(d)
## x w y
## Min. : 0.0 Min. : 3.00 Min. :0.0
## 1st Qu.: 0.5 1st Qu.: 4.50 1st Qu.:0.0
## Median : 1.0 Median : 6.00 Median :0.5
## Mean : 333.7 Mean : 36.33 Mean :0.5
## 3rd Qu.: 500.5 3rd Qu.: 53.00 3rd Qu.:1.0
## Max. :1000.0 Max. :100.00 Max. :1.0
## NA's :1 NA's :1
The default missing value imputation
By default, vtreat fills in missing values with the mean value of the column, and adds an advisory *_is_bad column to mark the location of the original missing values.
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE)
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)
| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0.0000 | 0 | 3.00000 | 0 | 0 |
| 1.0000 | 0 | 6.00000 | 0 | 0 |
| 1000.0000 | 0 | 36.33333 | 1 | 1 |
| 333.6667 | 1 | 100.00000 | 0 | 1 |
Changing the imputation strategy
If you do not want to use the mean to fill in missing values, you can change the imputation function using the parameter missingness_imputation. Here, we fill in missing values with the median.
median2 <- function(x, wts) {
median(x)
}
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
missingness_imputation = median2)
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)
| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 |
| 1 | 0 | 6 | 0 | 0 |
| 1000 | 0 | 6 | 1 | 1 |
| 1 | 1 | 100 | 0 | 1 |
You can also use a constant value instead of a function. Here we replace missing values with the value -1.
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
missingness_imputation = -1)
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)
| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 |
| 1 | 0 | 6 | 0 | 0 |
| 1000 | 0 | -1 | 1 | 1 |
| -1 | 1 | 100 | 0 | 1 |
Changing the imputation strategy per column
You can control the imputation strategy per column via the map imputation_map. Any column not named in the imputation map will use the imputation strategy specified by the missingness_imputation parameter (which is the mean by default).
Here we use the maximum value to fill in the missing values for x and the value 0 to fill in the missing values for w.
max2 <- function(x, wts) {
max(x)
}
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
imputation_map = list(
x = max2,
w = 0
))
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)
| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 |
| 1 | 0 | 6 | 0 | 0 |
| 1000 | 0 | 0 | 1 | 1 |
| 1000 | 1 | 100 | 0 | 1 |
If we don't specify a column, vtreat looks atmissingness_imputation (in this case, -1).
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
missingness_imputation = -1,
imputation_map = list(
x = max2
))
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)
| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3 | 0 | 0 |
| 1 | 0 | 6 | 0 | 0 |
| 1000 | 0 | -1 | 1 | 1 |
| 1000 | 1 | 100 | 0 | 1 |
If missingness_imputation is not specified, vtreat uses a weighted mean.
treatments <- designTreatmentsZ(d,
varlist = c('x', 'w'),
verbose = FALSE,
imputation_map = list(
x = max2
))
d_treated <- prepare(treatments,
d)
d_treated$y <- d$y
knitr::kable(d_treated)
| x | x_isBAD | w | w_isBAD | y |
|---|---|---|---|---|
| 0 | 0 | 3.00000 | 0 | 0 |
| 1 | 0 | 6.00000 | 0 | 0 |
| 1000 | 0 | 36.33333 | 1 | 1 |
| 1000 | 1 | 100.00000 | 0 | 1 |