Chapter 6 Missing values
Goal: provide tools to visualize missingness in a dataset and to get a complete dataset either by removing rows/columns or imputing missing values.
ASTERICS user has 3 tools to handle missing values in a dataset:
- Explore the missing values (
r_missing_overview
) - Remove the missing values (
r_missing_showremove
) - Impute the missing values (
r_missing_impute
)
<- read.table("../forge/backend/R/data/protein.csv",
proteins row.names = 1, header = TRUE, sep=" ")
set.seed(1206)
<- mice::ampute(proteins[c(1, 51, 101)], prop = 0.6,
proteins_miss mech = "MNAR")
<- cbind(matrix(rep(proteins_miss$patterns[ ,1], 50), ncol = 50),
my_pattern matrix(rep(proteins_miss$patterns[ ,2], 50), ncol = 50),
matrix(rep(proteins_miss$patterns[ ,3], 42), ncol = 42))
<- runif(nrow(my_pattern) * ncol(my_pattern)) <= 0.5
idx_missing dim(idx_missing) <- dim(my_pattern)
<- 1 - my_pattern[idx_missing]
my_pattern[idx_missing] <- mice::ampute(proteins, patterns = my_pattern)
proteins_miss_final <- proteins_miss_final$amp proteinsMNAR
6.1 Overview of missing values
This function is meant to be called directly by the interface. It produces an object stored in the global environment.
The function takes 2 arguments:
datasetName
: the name of the dataset.interactive
: TRUE/FALSE. Should the heatmap be produced in interactive mode or not? Default to TRUE.
Interactive heatmap can be very long to print on screen.
<- r_missing_overview("proteinsMNAR", interactive = TRUE) out_overview
Returns a list of length 3:
Object
: the output ofr_missing_summarize(datasetName)
, used as pre-calculation for internal re-use of the object.Table
: the numerical outputs, statistics on missings in the dataset.Graphical
: the graphical outputs.
6.1.1 Object
component
The Object
component is the result of r_missing_summarize(datasetName)
, used as pre-calculation for internal re-use of the object (in case of static then interactive mode).
The object is named out_miss
and is a named list of length 9.
MissingMatrix
: theTRUE
/FALSE
matrix of the NA’s in the datasetMissingProp
: adata.frame
with 3 rows: the number and proportion of components with missing values for the 3 components (cells, rows and columns)MissingByRows
: adata.frame
of the row names, the % and nb of missings for each one, sorted by decreasing order of number of missing valuesMissingByColumns
: adata.frame
of the column names, the % and nb of missings for each one, sorted by decreasing order of number of missing valuesTableMissRows
: by number of missing values, the number of rows (and the % of rows that it represents)TableMissCols
: by number of missing values, the number of columns (and the % of columns that it represents)Pattern
: the number of simultaneous observations by combinations of missing value positionsTable
: the same output as the mainTable
component, used here for storage, to avoid re-computationsGraphical
: the same output as the mainGraphical
component, used here for storage, to avoid re-computations
The elements of the Object
component are not further detailed since Object
is not meant to be used by the interface.
6.1.2 Table
component
The Table
component is a named list of length 5.
PropsTable
: the global proportion of missing values inproteinsMNAR
PropRowsTable
: the proportion of missing values inproteinsMNAR
, by rowPropColsTable
: the proportion of missing values inproteinsMNAR
, by columnRowsTable
: the percentage and number of rows by number of missing values they have inColsTable
: the percentage and number of columns by number of missing values they have in
::json_tree_view(out_overview$Table$PropsTable, scroll=TRUE) jsonview
::json_tree_view(out_overview$Table$PropRowsTable, scroll=TRUE) jsonview
::json_tree_view(out_overview$Table$PropColsTable, scroll=TRUE) jsonview
::json_tree_view(out_overview$Table$RowsTable, scroll=TRUE) jsonview
::json_tree_view(out_overview$Table$ColsTable, scroll=TRUE) jsonview
6.1.3 Graphical
component
The Graphical
component is a named list of length 5:
OriginalHeatmap
: the heatmap of the original (not ordered) dataset, with missing values in grey, and observed values in white.OrderedHeatmap
: the heatmap of missing values, ordered by column and row percentage of missing values.BarplotOrderedMissRows
: a barplot of missing percentage by rowsBarplotOrderedMissCols
: a barplot of missing percentage by columnsMapPattern
: the global pattern of missingness. Number of simultaneous rows and columns by missing values.
In interactive mode, all graphs are interactive.
In non interactive mode, the heatmaps (OriginalHeatmap
, OrderedHeatmap
, and MapPattern
) are in png format, the other plots remain interactive.
6.2 Remove missing values
There is two steps to remove missing values:
- Show the rows or columns that will be removed if they match the parameters
- Effectively remove the rows/columns and create a new dataset
So there is two functions:
r_missing_showremove
that takes 3 arguments:
- The name of the dataset (
datasetName
) - The dimension on which to filter the missing values (
direction
): one of"rows"
,"columns"
. For"rows"
: remove the rows containing missing values, for"columns"
: remove the columns containing missing values. - The threshold on which to filter the rows or columns (
threshold
), in percentage of missing values. Ex. a threshold of 10 in the row direction will suppress the rows having strictly more than 10% of missing values.
r_missing_extractremove
that takes 2 arguments:
- The name of the object from
r_missing_showremove
(objectName
) - A user name, if the user wants to rename the new
data.frame
(userName
).
6.2.1 Function r_missing_showremove
<- r_missing_showremove("proteinsMNAR", "rows", threshold = 46) proteinsNoNA_rows
The obtained object has 3 components (Object
and Graphical
and Table
). Only Graphical
and Table
are passed to the interface by r_wrapp
.
6.2.1.1 Graphical
In graphical, one heatmap is produced from the original dataset, showing missing values and the rows or columns that will be removed according to the parameters:
::json_tree_view(proteinsNoNA_rows$Graphical, height = "150px") jsonview
::include_graphics(proteinsNoNA_rows$Graphical$beforeHeatMap$path) knitr
6.2.1.2 Table
A summary table of the removal that would be produced with the given parameters is also returned:
$Table$statsRemove$data proteinsNoNA_rows
## label number
## 1 Number of individuals that will be removed 71
## 2 Number of individuals that will be kept 308
## 3 Number of individuals that will be kept with remaining NA's 119
## 4 Number of missings cells before removal 12298
## 5 Number of missing cells that will be kept 7399
In JSON as returned for the interface:
::json_tree_view(proteinsNoNA_rows$Table, scroll = TRUE) jsonview
6.2.2 Function r_missing_extractremove
This function is supposed to be called after r_missing_showremove
and creates the dataset after removal of the rows/columns according to the parameters.
<- r_wrapp("r_missing_showremove", "proteinsMNAR", "rows", threshold = 46)
tt <- r_wrapp("r_missing_extractremove", "MissShowRemove_1") proteinsNoNA_extract
Two components are returned by the function: Object
and Table
.
The Table
component only contains the name of the new dataset:
::json_tree_view(proteinsNoNA_extract, height = "100px") jsonview
6.3 Impute missing values
The function r_missing_impute
takes 3 arguments:
- The name of the dataset (
datasetName
) - The method of imputation:
"pca"
,"knn"
or"zero
(method
) - The type of variable on which performs imputation:
"n"
(numerical),"c"
(categorial),"m"
(mixed) or"a"
(auto = the most represented type) (typeVar
)
Methods of imputation:
- “pca”:
missMDA::MIPCA()$res.imputePCA
- “knn”:
VIM::kNN()
- “zero”: replace NA’s by zeros
6.3.1 Example: imputation with the PCA method:
<- r_wrapp("r_missing_impute", "proteinsMNAR", "pca") proteinsNoNA_pca
This produces two objects: a dataset with imputed values and a MissImpute
object.
It returns only the Table
and Graphical
of the MissImpute
object to the interface:
::json_tree_view(proteinsNoNA_pca, scroll = TRUE) jsonview
The ImputeInfo
table contains this information:
$ImputeInfo$data imputeMissing_1
## indicator value
## 1 Types of variables that have been imputed: numeric
## 2 Number of variables with imputed missing values: 379
## 3 Number of imputed missing values: 12298
And the heatmap, always in png:
::include_graphics(imputeMissing_1$Graphical$Heatmap$path) knitr
The same thing is done for knn and imputation by zeros methods.
If the chosen type is not numerical (i.e. typeVar
= "c"
or "a"
leading to "c"
), there is no graph in the output.