Chapter 6 Missing values

Goal: provide tools to visualize missingness in a dataset and to get a complete dataset either by removing rows/columns or imputing missing values.

ASTERICS user has 3 tools to handle missing values in a dataset:

  1. Explore the missing values (r_missing_overview)
  2. Remove the missing values (r_missing_showremove)
  3. Impute the missing values (r_missing_impute)
proteins <- read.table("../forge/backend/R/data/protein.csv",
                            row.names = 1, header = TRUE, sep=" ")
set.seed(1206)
proteins_miss <- mice::ampute(proteins[c(1, 51, 101)], prop = 0.6,
                              mech = "MNAR")
my_pattern <- cbind(matrix(rep(proteins_miss$patterns[ ,1], 50), ncol = 50),
                    matrix(rep(proteins_miss$patterns[ ,2], 50), ncol = 50),
                    matrix(rep(proteins_miss$patterns[ ,3], 42), ncol = 42))
idx_missing <- runif(nrow(my_pattern) * ncol(my_pattern)) <= 0.5
dim(idx_missing) <- dim(my_pattern)
my_pattern[idx_missing] <- 1 - my_pattern[idx_missing]
proteins_miss_final <- mice::ampute(proteins, patterns = my_pattern)
proteinsMNAR <- proteins_miss_final$amp

6.1 Overview of missing values

r_missing_overview

This function is meant to be called directly by the interface. It produces an object stored in the global environment.

The function takes 2 arguments:

  • datasetName: the name of the dataset.
  • interactive: TRUE/FALSE. Should the heatmap be produced in interactive mode or not? Default to TRUE.

Interactive heatmap can be very long to print on screen.

out_overview <- r_missing_overview("proteinsMNAR", interactive = TRUE)

Returns a list of length 3:

  • Object: the output of r_missing_summarize(datasetName), used as pre-calculation for internal re-use of the object.
  • Table: the numerical outputs, statistics on missings in the dataset.
  • Graphical: the graphical outputs.

6.1.1 Object component

The Object component is the result of r_missing_summarize(datasetName), used as pre-calculation for internal re-use of the object (in case of static then interactive mode).

The object is named out_miss and is a named list of length 9.

  • MissingMatrix: the TRUE/FALSE matrix of the NA’s in the dataset
  • MissingProp: a data.frame with 3 rows: the number and proportion of components with missing values for the 3 components (cells, rows and columns)
  • MissingByRows: a data.frame of the row names, the % and nb of missings for each one, sorted by decreasing order of number of missing values
  • MissingByColumns: a data.frame of the column names, the % and nb of missings for each one, sorted by decreasing order of number of missing values
  • TableMissRows: by number of missing values, the number of rows (and the % of rows that it represents)
  • TableMissCols: by number of missing values, the number of columns (and the % of columns that it represents)
  • Pattern: the number of simultaneous observations by combinations of missing value positions
  • Table: the same output as the main Table component, used here for storage, to avoid re-computations
  • Graphical: the same output as the main Graphical component, used here for storage, to avoid re-computations

The elements of the Object component are not further detailed since Object is not meant to be used by the interface.

6.1.2 Table component

The Table component is a named list of length 5.

  • PropsTable: the global proportion of missing values in proteinsMNAR
  • PropRowsTable: the proportion of missing values in proteinsMNAR, by row
  • PropColsTable: the proportion of missing values in proteinsMNAR, by column
  • RowsTable: the percentage and number of rows by number of missing values they have in
  • ColsTable: the percentage and number of columns by number of missing values they have in
jsonview::json_tree_view(out_overview$Table$PropsTable, scroll=TRUE)
jsonview::json_tree_view(out_overview$Table$PropRowsTable, scroll=TRUE)
jsonview::json_tree_view(out_overview$Table$PropColsTable, scroll=TRUE)
jsonview::json_tree_view(out_overview$Table$RowsTable, scroll=TRUE)
jsonview::json_tree_view(out_overview$Table$ColsTable, scroll=TRUE)

6.1.3 Graphical component

The Graphical component is a named list of length 5:

  • OriginalHeatmap: the heatmap of the original (not ordered) dataset, with missing values in grey, and observed values in white.
  • OrderedHeatmap: the heatmap of missing values, ordered by column and row percentage of missing values.
  • BarplotOrderedMissRows: a barplot of missing percentage by rows
  • BarplotOrderedMissCols: a barplot of missing percentage by columns
  • MapPattern: the global pattern of missingness. Number of simultaneous rows and columns by missing values.

In interactive mode, all graphs are interactive. In non interactive mode, the heatmaps (OriginalHeatmap, OrderedHeatmap, and MapPattern) are in png format, the other plots remain interactive.

6.1.3.1 Heatmaps

out_overview$Graphical$OriginalHeatmap
out_overview$Graphical$OrderedHeatmap

6.1.3.2 Barplot for rows

out_overview$Graphical$BarplotOrderedMissRows

6.1.3.3 Barplot for columns

out_overview$Graphical$BarplotOrderedMissCols

6.1.3.4 Missing pattern

out_overview$Graphical$MapPattern
## NULL

6.1.4 On a data.frame with no missing

proteinsNoNA <- read.table("../forge/backend/R/data/protein.csv",
                            row.names = 1, header = TRUE)
out_overview_NoNA <- r_missing_overview("proteinsNoNA")
# out_viz <- r_missing_visualize(out_summ)

6.2 Remove missing values

remove_workflow

There is two steps to remove missing values:

  1. Show the rows or columns that will be removed if they match the parameters
  2. Effectively remove the rows/columns and create a new dataset

So there is two functions:

  1. r_missing_showremove that takes 3 arguments:
  • The name of the dataset (datasetName)
  • The dimension on which to filter the missing values (direction): one of "rows", "columns". For "rows": remove the rows containing missing values, for "columns": remove the columns containing missing values.
  • The threshold on which to filter the rows or columns (threshold), in percentage of missing values. Ex. a threshold of 10 in the row direction will suppress the rows having strictly more than 10% of missing values.
  1. r_missing_extractremove that takes 2 arguments:
  • The name of the object from r_missing_showremove (objectName)
  • A user name, if the user wants to rename the new data.frame (userName).

6.2.1 Function r_missing_showremove

proteinsNoNA_rows <- r_missing_showremove("proteinsMNAR", "rows", threshold = 46)

The obtained object has 3 components (Object and Graphical and Table). Only Graphical and Table are passed to the interface by r_wrapp.

6.2.1.1 Graphical

In graphical, one heatmap is produced from the original dataset, showing missing values and the rows or columns that will be removed according to the parameters:

jsonview::json_tree_view(proteinsNoNA_rows$Graphical, height = "150px")
knitr::include_graphics(proteinsNoNA_rows$Graphical$beforeHeatMap$path)

6.2.1.2 Table

A summary table of the removal that would be produced with the given parameters is also returned:

proteinsNoNA_rows$Table$statsRemove$data
##                                                         label number
## 1                  Number of individuals that will be removed     71
## 2                     Number of individuals that will be kept    308
## 3 Number of individuals that will be kept with remaining NA's    119
## 4                     Number of missings cells before removal  12298
## 5                   Number of missing cells that will be kept   7399

In JSON as returned for the interface:

jsonview::json_tree_view(proteinsNoNA_rows$Table, scroll = TRUE)

6.2.2 Function r_missing_extractremove

This function is supposed to be called after r_missing_showremove and creates the dataset after removal of the rows/columns according to the parameters.

tt <- r_wrapp("r_missing_showremove", "proteinsMNAR", "rows", threshold = 46)
proteinsNoNA_extract <- r_wrapp("r_missing_extractremove", "MissShowRemove_1")

Two components are returned by the function: Object and Table.

The Table component only contains the name of the new dataset:

jsonview::json_tree_view(proteinsNoNA_extract, height = "100px")

6.3 Impute missing values

r_missing_impute

The function r_missing_impute takes 3 arguments:

  • The name of the dataset (datasetName)
  • The method of imputation: "pca", "knn" or "zero (method)
  • The type of variable on which performs imputation: "n" (numerical), "c" (categorial), "m" (mixed) or "a" (auto = the most represented type) (typeVar)

Methods of imputation:

  • “pca”: missMDA::MIPCA()$res.imputePCA
  • “knn”: VIM::kNN()
  • “zero”: replace NA’s by zeros

6.3.1 Example: imputation with the PCA method:

proteinsNoNA_pca <- r_wrapp("r_missing_impute", "proteinsMNAR", "pca")

This produces two objects: a dataset with imputed values and a MissImpute object. It returns only the Table and Graphical of the MissImpute object to the interface:

jsonview::json_tree_view(proteinsNoNA_pca, scroll = TRUE)

The ImputeInfo table contains this information:

imputeMissing_1$ImputeInfo$data
##                                          indicator   value
## 1       Types of variables that have been imputed: numeric
## 2 Number of variables with imputed missing values:     379
## 3                Number of imputed missing values:   12298

And the heatmap, always in png:

knitr::include_graphics(imputeMissing_1$Graphical$Heatmap$path)

The same thing is done for knn and imputation by zeros methods.

If the chosen type is not numerical (i.e. typeVar = "c" or "a" leading to "c"), there is no graph in the output.

6.4 STILL TODO

  • faire la fonction r_missing_exploreImputed pour explorer un jeu de données imputé.

  • rajouter une méthode d’imputation pour les données censurées à gauche.

L’interface peut être opérationnelle sans ces 2 développements !!