Creating an Imputation Method
Function makeImputeMethod permits to create your own imputation method. For this purpose you need to specify a learn function that extracts the necessary information and an impute function that does the actual imputation. The learn and impute functions both have at least the following formal arguments:
datais a data.frame with missing values in some features.colindicates the feature to be imputed.targetindicates the target variable(s) in a supervised learning task.
Example: Imputation using the mean
Let's have a look at function imputeMean.
imputeMean = function() {
makeImputeMethod(learn = function(data, target, col) mean(data[[col]], na.rm = TRUE),
impute = simpleImpute)
}
imputeMean calls the unexported mlr function simpleImpute which is defined as follows.
simpleImpute = function(data, target, col, const) {
if (is.na(const))
stopf("Error imputing column '%s'. Maybe all input data was missing?", col)
x = data[[col]]
if (is.factor(x) && const %nin% levels(x)) {
levels(x) = c(levels(x), as.character(const))
}
replace(x, is.na(x), const)
}
The learn function calculates the mean of the non-missing observations in column col.
The mean is passed via argument const to the impute function that replaces all missing values
in feature col.
Writing your own imputation method
Now let's write a new imputation method: A frequently used simple technique for longitudinal data is last observation carried forward (LOCF). Missing values are replaced by the most recent observed value.
In the R code below the learn function determines the last observed value previous to
each NA (values) as well as the corresponding number of consecutive NA's (times).
The impute function generates a vector by replicating the entries in values
according to times and replaces the NA's in feature col.
imputeLOCF = function() {
makeImputeMethod(
learn = function(data, target, col) {
x = data[[col]]
ind = is.na(x)
dind = diff(ind)
lastValue = which(dind == 1) # position of the last observed value previous to NA
lastNA = which(dind == -1) # position of the last of potentially several consecutive NA's
values = x[lastValue] # last observed value previous to NA
times = lastNA - lastValue # number of consecutive NA's
return(list(values = values, times = times))
},
impute = function(data, target, col, values, times) {
x = data[[col]]
replace(x, is.na(x), rep(values, times))
}
)
}
Note that this function is just for demonstration and is lacking some checks for real-world
usage (for example 'What should happen if the first value in x is already missing?').
Below it is used to impute the missing values in features Ozone and Solar.R in the
airquality data set.
data(airquality)
imp = impute(airquality, cols = list(Ozone = imputeLOCF(), Solar.R = imputeLOCF()),
dummy.cols = c("Ozone", "Solar.R"))
head(imp$data, 10)
#> Ozone Solar.R Wind Temp Month Day Ozone.dummy Solar.R.dummy
#> 1 41 190 7.4 67 5 1 FALSE FALSE
#> 2 36 118 8.0 72 5 2 FALSE FALSE
#> 3 12 149 12.6 74 5 3 FALSE FALSE
#> 4 18 313 11.5 62 5 4 FALSE FALSE
#> 5 18 313 14.3 56 5 5 TRUE TRUE
#> 6 28 313 14.9 66 5 6 FALSE TRUE
#> 7 23 299 8.6 65 5 7 FALSE FALSE
#> 8 19 99 13.8 59 5 8 FALSE FALSE
#> 9 8 19 20.1 61 5 9 FALSE FALSE
#> 10 8 194 8.6 69 5 10 TRUE FALSE