Chapter 14 Differential analysis

This file describes the different steps of a statistical test of all numeric variables in a dataset with respect to a level of a categorical variable.

14.1 Workflow structure

First, the user starts with multiple tests, which are set depending on the data nature, using the function r_multiple_tests to test global differences between groups:

  • for rna-count or metagenomics-count, if the data have not been log-transformed, they are processed with the GLM model from edgeR to detect genes that are globally different accross the levels of the factor. If the data have been log-transformed, multiple tests resume to the generic case;

  • for the generic case, either parametric (ANOVA or Student) or nonparametric (Kruskal-Wallis / Mann Withney) are performed.

In addition, depending on the nature of data, some warning messages are displayed:

  • for rna-count or metagenomics-count, if the data are log-transformed, the user is advised to use count data;

  • for rna-count or metagenomics-count, if the data are not normalized (the attribute norm_factors is not present in the object), the user is advised to perform normalization first. Which normalization is adapted to the data is also specified in the message;

  • for microarray, if the data are not normalized or not log-transformed, the user is advised to do both;

  • for metabolite-compo and metagenomics-compo, if the data are not normalized, the user is warned about a potential compositional bias in the analysis and is advised to perform a standard transformation (CLR, ILR, …), which, in ASTERICS, is considered as a normalization for these data.

Second, if positive results have been obtained from this first analysis, the user can (in some cases) also perform: - r_posthoc_tests: posthoc pairwise comparisons between levels of the categorical variable if it has more than two levels (the same decision regarding the type of tests and corresponding warnings are used); - r_posthoc_plots, to represent pairwise differences of a given numeric (significant) variable with a forest plot (only for posthoc tests relative to 1-way ANOVA).

In addition, the function r_extract_diff is available to extract the results of the analysis, either the \(p\)-value table or a subset of the original numeric variables containing only differential variables.

14.2 Used datasets

14.2.1 Generic case

The workflow will be illustrated on datasets clinical and proteins which contain mixed-type variables and numeric variables respectively.

out_import1 <- r_wrapp("r_import", 
                       input = "../forge/backend/R/data/protein.csv",
                       data.name = "proteins",
                       nature = "protein quantification", header = TRUE,
                       row.names = 1, sep = " ", normalized = "yes")
out_import2 <- r_wrapp("r_import", 
                       input = "../forge/backend/R/data/clinical.csv",
                       data.name = "clinical", nature = "phenotypes",
                       header = TRUE, row.names = 1, sep = ",")

Both datasets need to be combined in a multipleDA object as for PLS-DA analysis, which also includes an information on the so-called target variable (the categorical variable that, here, is one of the variables of clinical).

out_combtgt <- r_wrapp("r_combine_target", X = "proteins", Y = "clinical",
                       target = "patient.samples.sample.2.sample_type")

14.2.2 Count case

In addition, the count case will be illustrated with fake (simulated) count data:

# input_file <- "../forge/backend/R/data/fakeRNAseq.csv"
# out <- r_wrapp("r_import", input_file, data.name = "myrnaseq", row.names = 1, 
#                header = TRUE, nature = "rna-count", logt = "no", 
#                normalized = "no")
# input_file2 <- "../forge/backend/R/data/fakeCategorical.csv"
# out <- r_wrapp("r_import", input_file2, data.name = "categoricals", 
#                row.names = 1, header = TRUE, nature = "generic")
tmprnaseq <- data.frame(matrix(abs(40*rnorm(7*1000)), nrow = 7))
rownames(tmprnaseq) <- paste0("ind", 1:nrow(tmprnaseq))
colnames(tmprnaseq) <- paste0("Var", 1:ncol(tmprnaseq))
write.csv(tmprnaseq, file="../forge/backend/R/data/tmpRNASEQ.csv", sep=",")
input_file <- "../forge/backend/R/data/tmpRNASEQ.csv"
out <- r_wrapp("r_import", input_file, data.name = "myrnaseq", row.names = 1, 
               header = TRUE, nature = "rna-count", logt = "no", sep=",",
               normalized = "no")
categoricals <- data.frame("version1" = rep(c("A", "B", "C"), each=3),
                           "version2" = rep(c("A", "B", "C"), 3))
rownames(categoricals) <- paste0("ind", 1:nrow(categoricals))
## counts, normalized, more than 2 levels
out_norm <- r_wrapp("r_norm_dataset", "myrnaseq", template = "rnaseqcount",
                    has_log = FALSE)
out_norm_extract <- r_wrapp("r_extract_dataset", "normalized_1")
out_combtgt2 <- r_wrapp("r_combine_target", "normalized_1", "categoricals", 
                        "version2")

14.3 Multiple tests

14.3.1 Input

The inputs of r_multiple_tests are:

  • datasetName: the name of an object that is an output of r_combine_target, as described before;

  • type: type of the tests to use. It can be "p" for parametric, "n" for non-parametric, "a" for automatic (the latter being the default value), or "c" for counts. If "a" is selected, then type "n" or "p" is automatically attributed depending on the results of normality tests. "c" is systematically used for count datasets that have not been log-transformed;

  • correction: the multiple testing correction method for parametric tests. It can be "BH" (the default value, for control of the FDR), "bonferroni" (for control of the FWER), or "none" (no correction).

  • threshold: the risk threshold at which a test is considered as significant (based on corrected p-values). Default value is set to \(0.05\).

14.3.2 Output

out_multiple <- r_multiple_tests("multipleDA_1", type = "p", threshold = 0.8)
out_multiple_wrapp <- r_wrapp("r_multiple_tests", "multipleDA_1", type ="p",
                              threshold = 0.8)
names(out_multiple)
## [1] "Table"     "Graphical" "Object"

The output of r_multiple_tests function is a list including:

14.3.2.1 Table elements

  • TestView, a JSON table containing the names of all tested numeric variables, the test statistics, the p-values, the ajusted p-values, and (optionally, only if the categorical variable has two levels) the sign of the differences between means (parametric test) or medians (non parametric tests). This table is a table of type BasicTable:
jsonview::json_tree_view(
    jsonlite::toJSON(out_multiple$Table$TestView, pretty = TRUE, 
                     auto_unbox = TRUE), 
    scroll = TRUE)
  • DiffVar, a list containing the names of the differentially expressed variables
head(out_multiple$Table$DiffVar)
## [1] "4E.BP1_pS65" "4E.BP1_pT37" "4E.BP1_pT70" "A.Raf_pS299" "ASNS"       
## [6] "Akt_pS473"
  • AllVariables (not to be displayed on the interface): the list of all variables (that has been tested)
head(out_multiple$Table$AllVariables)
## [1] "14.3.3_epsilon" "4E.BP1"         "4E.BP1_pS65"    "4E.BP1_pT37"   
## [5] "4E.BP1_pT70"    "53BP1"
  • Posthoc (not to be displayed on the interface): a variable coded "yes" if the categorical has more than two groups and that some variables are found differentially expressed (and that posthoc tests can be performed), "no" otherwise
out_multiple$Table$Posthoc
## [1] "yes"
  • Posthocplot (not to be displayed on the interface): a variable coded "yes" if Posthoc is "yes" and that, in the performed test is parametric (and that forest plots for posthoc tests can be obtained)
out_multiple$Table$Posthocplot
## [1] "yes"
  • HasSign (not to be displayed on the interface): a variable coded "yes" if the categorical variable has only 2 levels (and that the table TestView has a sign column)
out_multiple$Table$HasSign
## [1] "no"
  • ObjectName: the name of the analysis object, to be passed to further analyses
out_multiple$Table$ObjectName
## [1] "DiffAnalysis_1"

14.3.2.2 Graphical elements

  • Histogram, a plot made of side-by-side histograms (p-values on the left, adjusted p-values on the right)
out_multiple$Graphical$Histogram
  • Heatmap, an heatmap with individuals and levels of the tested categorical variable in rows, differential numeric variables in columns. If none variable is found differential, this plot is not provided
out_multiple$Graphical$Heatmap

14.3.2.3 Object elements

Calling r_multiple_tests through r_wrapp creates two objects DiffMultiple_* and DiffAnalysis_*:

  • DiffMultiple, a hidden object that contains the table of test results (similar to what is in $Table$TestView but under the form of a data.frame)

  • DiffAnalysis, a hidden object made of the elements in $Table and $Graphical, displayed in the workspace DAG and passed to further functions.

14.3.3 Count test data

Count data tests are performed with:

out_multiple2 <- r_multiple_tests("multipleDA_2", type = "c", threshold = 0.7)

and contain the same information than for the generic case. In particular,

out_multiple2$Table$Posthoc
## [1] "yes"
out_multiple2$Table$Posthocplot
## [1] "no"
out_multiple2$Table$HasSign
## [1] "no"

14.3.4 Interface specification

  • The threshold has to be a number between \(0\) and \(1\).

14.4 Posthoc tests

14.4.1 Input

r_posthoc_tests takes a single input (datasetName), that is the ObjectName returned by r_multiple_tests.

14.4.2 Output

out_posthoc <- r_posthoc_tests("DiffAnalysis_1")
out_posthoc_wrapp <- r_wrapp("r_posthoc_tests", "DiffAnalysis_1")
names(out_posthoc)
## [1] "Table"  "Object"

The output of r_posthoc_tests function is a list including:

14.4.2.1 Table elements

  • TestView, a JSON table containing names of the posthoc-tested numeric variables, conditions, p-values and sign of the differences between means (in case parametric tests were performed) or differences between medians (otherwise). This table is of type BasicTable
jsonview::json_tree_view(
    jsonlite::toJSON(out_posthoc$Table$TestView, pretty = TRUE, 
                     auto_unbox = TRUE), 
    scroll = TRUE)
  • SignifTestView, a table containing the names of the posthoc-tested numeric variables and all conditions found significant. This table is of type BasicTable
jsonview::json_tree_view(
    jsonlite::toJSON(out_posthoc$Table$SignifTestView, pretty = TRUE,
                     auto_unbox = TRUE),
    scroll = TRUE)
  • Conditions (not to be displayed on the interface): a vector with the names of all tested pairs of conditions
out_posthoc$Table$Conditions
## [1] "metastatic-blood derived normal"         
## [2] "primary tumor-blood derived normal"      
## [3] "solid tissue normal-blood derived normal"
## [4] "primary tumor-metastatic"                
## [5] "solid tissue normal-metastatic"          
## [6] "solid tissue normal-primary tumor"
  • HasSign (not to be displayed on the interface): a variable always equal to "yes" and indicating that posthoc tests always have a sign columns in their $Table$TestView
out_posthoc$Table$HasSign
## [1] "yes"
  • ObjectNamehe name of the analysis object, to be passed to further analyses
out_posthoc$Table$ObjectName
## [1] "DiffPosthocAnalysis_1"

14.4.2.2 Object elements

Calling r_multiple_tests through r_wrapp creates two objects DiffPosthocTable_* and DiffPosthocAnalysis_1:

  • DiffPosthocTable, a hidden object that contains the table of test results (similar to what is in $Table$TestView but under the form of a data.frame)

  • DiffPosthocAnalysis, a hidden object made of the elements in $Table, displayed in the workspace DAG and passed to further functions.

14.4.3 Interface specification

This function must not be called unless the results of the previous function has a $Table$Posthoc value equal to "yes".

14.5 Forest plots for posthoc tests

14.5.1 Input

The inputs of r_posthoc_plots function are:

  • datasetName: the $Table$ObjectName as given in the output of r_multiple_tests;

  • variableName: the name of the numeric variable (among the names of the variables for which a differential analysis has been performed and has led to a significant result, as in $Table$DiffVar) to conduct pairwise comparisons on.

14.5.2 Output

out_forest <- r_posthoc_plots("DiffAnalysis_1", "4E.BP1")
out_forest_wrapp <- r_wrapp("r_posthoc_plots", "DiffAnalysis_1", "4E.BP1")
names(out_forest)
## [1] "Graphical"

The output of r_posthoc_tests function is a list with only a Graphical element (that is not stored in object_db) that contains the forest plot:

out_forest$Graphical$ForestPlot

14.5.3 Interface specification

This function must not be called unless the results of the previous function has a $Table$Posthocplot value equal to "yes".

14.6 Data extraction

14.6.1 Input

The inputs of r_extract_diff function are:

  • datasetName: the name of an object that is the $Table$Object of either the output of r_multiple_tests or the output of r_posthoc_tests;
  • userName: the name given to the object by the user;
  • type: the output type that can be either a subset of the original numeric dataset "dataset" (default value) or the test table "results";
  • conditions (only if posthoc tests are passed as inputs): the conditions of interest to filter either the test results or the significant variables;
  • sign (used only if type == "results" and $Table$HasSign of the output is "yes"): sign of the variables (over-expressed, under-expressed or both) selected for the output. It is coded as "+", "-", or "both" (default value);
  • variables (used only if type == "results" and optional): list of variables to select for the output;
  • threshold (used only when type == "results" and optional): maximum adjusted value to keep (risk threshold). If NULL, the default value is the threshold specified for the input analysis.

14.6.2 Output

out_extract <- r_extract_diff("DiffPosthocAnalysis_1", threshold = 0.7,
                              conditions = "metastatic-blood derived normal",
                              type = "results")
out_extract_wrapp <- r_wrapp("r_extract_diff", "DiffPosthocAnalysis_1", 
                             threshold = 0.7, 
                             conditions = "metastatic-blood derived normal",
                             type = "results")
names(out_extract)
## [1] "Object" "Table"

The output of r_extract_diff function is a list including:

14.6.2.1 Table elements

ObjectName and UserName, the names of the created object, for object_db and for the user, respectively:

out_extract$Table$ObjectName
## [1] "DiffExtractRes_1"
out_extract$Table$UserName
## [1] "DiffExtractRes_1"

14.6.2.2 Object elements

A data frame added to the user’s workspace:

head(out_extract$Object$OBJextract)
##             variable                       condition       pVal   pValCor sign
## 85   GSK3.alpha.beta metastatic-blood derived normal 0.02977477 0.6483371    1
## 193 Transglutaminase metastatic-blood derived normal 0.03324805 0.6483371    1

14.6.3 Interface specification

Combinations of options depend on the source and of the type of output:

  • sign is needed only when the corresponding analysis (multiple test or posthoc test) returns $Table$Sign$ equal to "yes". In this case, its value can be "both", "+" or "-". In the other cases, it is automatically "both";

  • conditions is only valid for posthoc tests. By default, all pairs of conditions are returned (for conditions = NULL);

  • variables is only valid for outputs of type "results". Similarly to conditions, all variables are returned (for variables = NULL);

  • threshold is only valid for outputs of type "results". The advised default value is 1, which corresponds to returning all results.