Title: | A General Multivariate Imputation Framework |
---|---|
Description: | Multivariate Expectation-Maximization (EM) based imputation framework that offers several different algorithms. These include regularisation methods like Lasso and Ridge regression, tree-based models and dimensionality reduction methods like PCA and PLS. |
Authors: | Steffen Moritz [aut, cre] , Lingbing Feng [aut], Gen Nowak [ctb], Alan. H. Welsh [ctb], Terry. J. O'Neill [ctb] |
Maintainer: | Steffen Moritz <[email protected]> |
License: | GPL-3 |
Version: | 2.2 |
Built: | 2024-11-03 04:05:56 UTC |
Source: | https://github.com/steffenmoritz/imputer |
The imputeR package offers a General Multivariate Imputation Framework
The imputeR package is a Multivariate Expectation-Maximization (EM) based imputation framework that offers several different algorithms. These include regularisation methods like Lasso and Ridge regression, tree-based models and dimensionality reduction methods like PCA and PLS.
Steffen Moritz, Lingbing Feng, Gen Nowak, Alan. H. Welsh, Terry. J. O'Neill
Quinlan's Cubist model for imputation
CubistR(x, y)
CubistR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
and the optimal value for the "neighbors".
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "CubistR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "CubistR")
This function detects the type of the variables in a data matrix. Types can be continuous only, categorical only or mixed type. The rule for defining a variable as a categorical variable is when: (1) it is a character vector, (2) it contains no more than n = 5 unique values
Detect(x, n = 5)
Detect(x, n = 5)
x |
is the data matrix that need to be detected. |
n |
is a number, indicating how many levels, if outnumbered, can be seen as an numeric variable, rather than a categorical variable. |
the variable type for every column, can either be "numeric" or "character".
data(parkinson) Detect(parkinson) data(spect) Detect(spect) data(tic) table(Detect(tic))
data(parkinson) Detect(parkinson) data(spect) Detect(spect) data(tic) table(Detect(tic))
boosting tree for imputation
gbmC(x, y)
gbmC(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
and the best.iter for gbm model.
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "gbmC")
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "gbmC")
boosting variable selection for continuous data
glmboostR(x, y)
glmboostR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "glmboostR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "glmboostR")
This function use some primitive methods, including mean imputation, median imputation, random guess, or majority imputation (only for categorical variables), to impute a missing data matrix.
guess(x, type = "mean")
guess(x, type = "mean")
x |
a matrix or data frame |
type |
is the guessing type, including "mean" for mean imputation, "median" for median imputation, "random" for random guess, and "majority" for majority imputation for categorical variables. |
data(parkinson) # introduce some random missing values missdata <- SimIm(parkinson, 0.1) # impute by mean imputation impdata <- guess(missdata) # caculate the NRMSE Rmse(impdata, missdata, parkinson, norm = TRUE) # by random guessing, the NRMSE should be much bigger impdata2 <- guess(missdata, "random") Rmse(impdata2, missdata, parkinson, norm = TRUE)
data(parkinson) # introduce some random missing values missdata <- SimIm(parkinson, 0.1) # impute by mean imputation impdata <- guess(missdata) # caculate the NRMSE Rmse(impdata, missdata, parkinson, norm = TRUE) # by random guessing, the NRMSE should be much bigger impdata2 <- guess(missdata, "random") Rmse(impdata2, missdata, parkinson, norm = TRUE)
Impute missing values under the general framework in R
impute(missdata, lmFun = NULL, cFun = NULL, ini = NULL, maxiter = 100, verbose = TRUE, conv = TRUE)
impute(missdata, lmFun = NULL, cFun = NULL, ini = NULL, maxiter = 100, verbose = TRUE, conv = TRUE)
missdata |
data matrix with missing values encoded as NA. |
lmFun |
the variable selection method for continuous data. |
cFun |
the variable selection method for categorical data. |
ini |
the method for initilisation. It is a length one character if missdata contains only one type of variables only. For continous only data, ini can be "mean" (mean imputation), "median" (median imputation) or "random" (random guess), the default is "mean". For categorical data, it can be either "majority" or "random", the default is "majority". If missdata is mixed of continuous and categorical data, then ini has to be a vector of two characters, with the first element indicating the method for continous variables and the other element for categorical variables, and the default is c("mean", "majority".) |
maxiter |
is the maximum number of interations |
verbose |
is logical, if TRUE then detailed information will be printed in the console while running. |
conv |
logical, if TRUE, the convergence details will be returned |
This function can impute several kinds of data, including continuous-only data, categorical-only data and mixed-type data. Many methods can be used, including regularisation method like LASSO and ridge regression, tree-based model and dimensionality reduction method like PCA and PLS.
if conv = FALSE, it returns a completed data matrix with no missing values; if TRUE, it rrturns a list of components including:
imp |
the imputed data matrix with no missing values |
conv |
the convergence status during the imputation |
SimIm
for missing value simulation.
data(parkinson) # introduce 10% random missing values into the parkinson data missdata <- SimIm(parkinson, 0.1) # impute the missing values by LASSO impdata <- impute(missdata, lmFun = "lassoR") # calculate the normalised RMSE for the imputation Rmse(impdata$imp, missdata, parkinson, norm = TRUE)
data(parkinson) # introduce 10% random missing values into the parkinson data missdata <- SimIm(parkinson, 0.1) # impute the missing values by LASSO impdata <- impute(missdata, lmFun = "lassoR") # calculate the normalised RMSE for the imputation Rmse(impdata$imp, missdata, parkinson, norm = TRUE)
logistic regression with lasso for imputation
lassoC(x, y)
lassoC(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "lassoC")
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "lassoC")
LASSO variable selection for continuous data
lassoR(x, y)
lassoR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "lassoR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "lassoR")
This function is internally used by guess
, it
may be useless in reality.
major(x)
major(x)
x |
a character (or numeric categorical) vector with missing values |
the same length of vector with missing values being imputed by the majority class in this vector.
a <- c(rep(0, 10), rep(1, 15), rep(2, 5)) a[sample(seq_along(a), 5)] <- NA a b <- major(a) b
a <- c(rep(0, 10), rep(1, 15), rep(2, 5)) a[sample(seq_along(a), 5)] <- NA a b <- major(a) b
Calculate mixed error when the imputed matrix is mixed type
mixError(imp, mis, true, norm = TRUE)
mixError(imp, mis, true, norm = TRUE)
imp |
the imputed matrix |
mis |
the original matrix with missing values |
true |
the true matrix |
norm |
logical, if TRUE, the nomailised RMSE will return for continous variables |
a vector of two values indicating the mixed error the the imputation, the first one if either RMSE or NRMSE, the second one is MCE.
data(tic) Detect(tic) missdata <- SimIm(tic, 0.3) library(earth) impdata <- impute(tic, lmFun = "earth", cFun = "rpartC") mixError(impdata$imp, missdata, tic)
data(tic) Detect(tic) missdata <- SimIm(tic, 0.3) library(earth) impdata <- impute(tic, lmFun = "earth", cFun = "rpartC") mixError(impdata$imp, missdata, tic)
Naive imputation for mixed type data
mixGuess(missdata, method = c("mean", "majority"))
mixGuess(missdata, method = c("mean", "majority"))
missdata |
a data matrix with missing values |
method |
a character vector of length 2 indicating which two methods to use respectively for continuous variables and categorical variables. There are three options for continous variables: "mean", "median" and "random", and two options for categorical varaibles: "majority" and "random". The default method is "mean" for the continous part and "majority" for the categorical part. |
the same size data matrix with no missing value.
data(tic) missdata <- SimIm(tic, 0.1) sum(is.na(missdata)) impdata <- mixGuess(missdata) sum(is.na(impdata))
data(tic) missdata <- SimIm(tic, 0.1) sum(is.na(missdata)) impdata <- mixGuess(missdata) sum(is.na(impdata))
This function calculates the misclassfication error given the imputed data, the missing data and the true data.
mr(imp, mis, true)
mr(imp, mis, true)
imp |
the imputaed data matrix |
mis |
the missing data matrix |
true |
the ture data matrix |
The missclassification error
data(spect) Detect(spect) missdata <- SimIm(spect, 0.1) sum(is.na(missdata)) # impute using rpart impdata <- impute(missdata, cFun = "rpartC") # calculate the misclassification error mr(impdata$imp, missdata, spect)
data(spect) Detect(spect) missdata <- SimIm(spect, 0.1) sum(is.na(missdata)) # impute using rpart impdata <- impute(missdata, cFun = "rpartC") # calculate the misclassification error mr(impdata$imp, missdata, spect)
Ordered boxplot for a data matrix
orderbox(x, names = c("method", "MCE"), order.by = mean, decreasing = TRUE, notch = TRUE, col = "bisque", mar = c(7, 4.1, 4.1, 2), ...)
orderbox(x, names = c("method", "MCE"), order.by = mean, decreasing = TRUE, notch = TRUE, col = "bisque", mar = c(7, 4.1, 4.1, 2), ...)
x |
a matrix |
names |
a length two character vector, default is c("method, "MCE") |
order.by |
which statistics to order by, default is mean |
decreasing |
default is TRUE, the boxplot will be arranged in a decreasing order |
notch |
logical, default is TRUE |
col |
color for the boxplots, default is "bisque". |
mar |
the margin for the plot, adjust it to your need. |
... |
some other arguments that can be passed to the boxplot function |
a boxplot
data(parkinson) orderbox(parkinson)
data(parkinson) orderbox(parkinson)
This dataset contains a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease. Each row corresponds to one of 195 individuals and each column a measurement variable. This data was originally obtaind from the UCI Machine Learning Repository. For detailed information about the columns, see the reference and the source below. In the study of simulation, this dataset can be treated as continuous-only data
A data frame with 195 rows and 22 variables
MDVP:Fo(Hz). Average vocal fundamental frequency
MDVP:Fhi(Hz). Maximum vocal fundamental frequency
MDVP:Flo(Hz). Minimum vocal fundamental frequency
...
http://archive.ics.uci.edu/ml/datasets/Parkinsons
Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM, 2007 Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection, BioMedical Engineering OnLine
Principle component regression method for imputation
pcrR(x, y)
pcrR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "pcrR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "pcrR")
this is a plot function for assessing imputation performance given the imputed data and the original true data
plotIm(imp, mis, true, ...)
plotIm(imp, mis, true, ...)
imp |
the imputed data matrix |
mis |
the missing data matrix |
true |
the true data matrix |
... |
other arguments that can be passed to plot |
a plot object that show the imputation performance
data(parkinson) # introduce 10% random missing values into the parkinson data missdata <- SimIm(parkinson, 0.1) # impute the missing values by LASSO impdata <- impute(missdata, lmFun = "lassoR") # calculate the normalised RMSE for the imputation Rmse(impdata$imp, missdata, parkinson, norm = T) # Plot imputation performance plotIm(impdata$imp, missdata, parkinson)
data(parkinson) # introduce 10% random missing values into the parkinson data missdata <- SimIm(parkinson, 0.1) # impute the missing values by LASSO impdata <- impute(missdata, lmFun = "lassoR") # calculate the normalised RMSE for the imputation Rmse(impdata$imp, missdata, parkinson, norm = T) # Plot imputation performance plotIm(impdata$imp, missdata, parkinson)
Principle component regression method for imputation
plsR(x, y)
plsR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "plsR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "plsR")
Ridge regression with lasso for imputation
ridgeC(x, y)
ridgeC(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "ridgeC")
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "ridgeC")
Ridge shrinkage variable selection for continuous data
ridgeR(x, y)
ridgeR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "ridgeR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "ridgeR")
This function calculate imputation error given the imputed data, the missing data and the true data
Rmse(imp, mis, true, norm = FALSE)
Rmse(imp, mis, true, norm = FALSE)
imp |
the imputaed data matrix |
mis |
the missing data matrix |
true |
the true data matrix |
norm |
logical, if TRUE then the normalized RMSE will be returned |
the RMSE or NRMSE
impute
for the main imputation function,
mr
for the misclassification error metric.
data(parkinson) # introduce 10% random missing values into the parkinson data missdata <- SimIm(parkinson, 0.1) # impute the missing values by LASSO impdata <- impute(missdata, lmFun = "lassoR") # calculate the normalised RMSE for the imputation Rmse(impdata$imp, missdata, parkinson, norm = TRUE)
data(parkinson) # introduce 10% random missing values into the parkinson data missdata <- SimIm(parkinson, 0.1) # impute the missing values by LASSO impdata <- impute(missdata, lmFun = "lassoR") # calculate the normalised RMSE for the imputation Rmse(impdata$imp, missdata, parkinson, norm = TRUE)
classification tree for imputation
rpartC(x, y)
rpartC(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "rpartC")
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "rpartC")
Evaluate imputation performance by simulation
SimEval(data, task = NULL, p = 0.1, n.sim = 100, ini = "mean", method = NULL, guess = FALSE, guess.method = NULL, other = NULL, verbose = TRUE, seed = 1234)
SimEval(data, task = NULL, p = 0.1, n.sim = 100, ini = "mean", method = NULL, guess = FALSE, guess.method = NULL, other = NULL, verbose = TRUE, seed = 1234)
data |
is the complete data matrix that will be used for simulation |
task |
task type, either be 1 for regression, 2 for classification or 3 for mixed type |
p |
is the percentage of missing values that will be introduction into data, it has to be a value between 0 and 1 |
n.sim |
the number of simulations, default is 100 times |
ini |
is the initialization setting for some relevant imputation methods
, the default setting is "mean", while "median" and "random" can also be
used. See also |
method |
the imputaion method based on variable selection for simulation some other imputation method can be passed to the 'other' argument |
guess |
logical value, if is TRUE, then |
guess.method |
guess type for the guess function. It cannot be NULL if guess is TRUE |
other |
some other imputation method that is based on variable selection can be used. The requirement for this 'other' method is strict: it receives a data matrix including missing values and returns a complete data matrix. |
verbose |
logical, if TRUE, additional output information will be provided during iterations, i.e., the method that is using, the iteration number, the convegence difference as compared to the precious iteration. The progression bar will show up irrespective of this option and it can not be got rid of. |
seed |
set the seed for simulation so simulations using different imputation methods are comparable. The default value is set to 1234, which is not supposed to mean anything. But if 1234 is used, then the seed for simulating the first missing data matrix is 1234, then it sums by one for every subsequent simulationg data matrix. |
a list of componentes including
call |
the method used for imputation |
task |
the name of the task |
time |
computational time |
error |
the imputation error |
conv |
the number of iterations to converge |
data(parkinson) # WARNING: simulation may take considerable time. SimEval(parkinson, method = "lassoR")
data(parkinson) # WARNING: simulation may take considerable time. SimEval(parkinson, method = "lassoR")
This function randomly introduce some amount of missing values into a matrix.
SimIm(data, p = 0.1)
SimIm(data, p = 0.1)
data |
a data matrix to simulate |
p |
the percentage of missing values introduced into the data matrix it should be a value between 0 and 1. |
the same size matrix with simulated missing values.
# Create data without missing values as example simdata <- matrix(rnorm(100), 10, 10) # Now let's introduce some missing values into the dataset missingdata <- SimIm(simdata, p = 0.15) # count the number of missing values afterwards sum(is.na(missingdata)) #------------------ # There is no missing values in the original parkinson data data(parkinson) # Let's introduce some missing values into the dataset missdata <- SimIm(parkinson, 0.1) # count the number of missing values afterwards sum(is.na(missdata))
# Create data without missing values as example simdata <- matrix(rnorm(100), 10, 10) # Now let's introduce some missing values into the dataset missingdata <- SimIm(simdata, p = 0.15) # count the number of missing values afterwards sum(is.na(missingdata)) #------------------ # There is no missing values in the original parkinson data data(parkinson) # Let's introduce some missing values into the dataset missdata <- SimIm(parkinson, 0.1) # count the number of missing values afterwards sum(is.na(missdata))
The dataset describes diagnosing of cardiac Single Proton Emission Computed Tomography (SPECT) images Each of the patients is classified into two categories: normal and abnormal. The database of 267 SPECT image sets (patients) was processed to extract features that summarize the original SPECT images. As a result, 44 continuous feature pattern was created for each patient. The pattern was further processed to obtain 22 binary feature patterns. The CLIP3 algorithm was used to generate classification rules from these patterns. The CLIP3 algorithm generated rules that were 84.0 SPECT is a good data set for testing ML algorithms; it has 267 instances that are descibed by 23 binary attributes. In the imputation study, it can be treated as a categorical-only data. For detailed information, please refer to the Source and the Reference
A data frame with 266 rows and 23 variables
X1. OVERALL_DIAGNOSIS: 0,1 (class attribute, binary)
X0. F1: 0,1 (the partial diagnosis 1, binary)
...
http://archive.ics.uci.edu/ml/datasets/SPECT+Heart
Kurgan, L.A., Cios, K.J., Tadeusiewicz, R., Ogiela, M. & Goodenday, L.S. 2001 Knowledge Discovery Approach to Automated Cardiac SPECT Diagnosis Artificial Intelligence in Medicine, vol. 23:2, pp 149-169
Best subset variable selection from both forward and backward direction for categorical data
stepBackC(x, y)
stepBackC(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "stepBackC")
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "stepBackC")
Best subset variable selection (backward direction) for continuous data
stepBackR(x, y)
stepBackR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "stepBackR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "stepBackR")
Best subset variable selection from both forward and backward direction for categorical data
stepBothC(x, y)
stepBothC(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "stepBothC")
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "stepBothC")
Best subset variable selection from both forward and backward direction for continuous data
stepBothR(x, y)
stepBothR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "stepBothR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "stepBothR")
Best subset variable selection from both forward and backward direction for categorical data
stepForC(x, y)
stepForC(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "stepForC")
data(spect) missdata <- SimIm(spect, 0.1) impdata <- impute(spect, cFun = "stepForC")
Best subset variable selection (forward direction) for continuous data
stepForR(x, y)
stepForR(x, y)
x |
predictor matrix |
y |
response vector |
a model object that can be used by the impute
function
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "stepForR")
data(parkinson) missdata <- SimIm(parkinson, 0.1) impdata <- impute(missdata, lmFun = "stepForR")
This data set used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data. Detailed information, please refer to the Source. For imputation study, this dataset can be treated as a mixed-type data.
A data frame with 266 rows and 23 variables
V1. a numeric variable
V2. a categorical variable
...
http://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)
P. van der Putten and M. van Someren (eds). CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000.