Chapter 14 Differential analysis
This file describes the different steps of a statistical test of all numeric variables in a dataset with respect to a level of a categorical variable.
14.1 Workflow structure
First, the user starts with multiple tests, which are set depending on the
data nature
, using the function r_multiple_tests
to test global differences
between groups:
for
rna-count
ormetagenomics-count
, if the data have not been log-transformed, they are processed with the GLM model fromedgeR
to detect genes that are globally different accross the levels of the factor. If the data have been log-transformed, multiple tests resume to the generic case;for the generic case, either parametric (ANOVA or Student) or nonparametric (Kruskal-Wallis / Mann Withney) are performed.
In addition, depending on the nature of data, some warning messages are displayed:
for
rna-count
ormetagenomics-count
, if the data are log-transformed, the user is advised to use count data;for
rna-count
ormetagenomics-count
, if the data are not normalized (the attributenorm_factors
is not present in the object), the user is advised to perform normalization first. Which normalization is adapted to the data is also specified in the message;for
microarray
, if the data are not normalized or not log-transformed, the user is advised to do both;for
metabolite-compo
andmetagenomics-compo
, if the data are not normalized, the user is warned about a potential compositional bias in the analysis and is advised to perform a standard transformation (CLR, ILR, …), which, in ASTERICS, is considered as a normalization for these data.
Second, if positive results have been obtained from this first analysis, the
user can (in some cases) also perform:
- r_posthoc_tests
: posthoc pairwise comparisons between levels of the
categorical variable if it has more than two levels (the same decision regarding
the type of tests and corresponding warnings are used);
- r_posthoc_plots
, to represent pairwise differences of a given numeric
(significant) variable with a forest plot (only for posthoc tests relative to
1-way ANOVA).
In addition, the function r_extract_diff
is available to extract the results
of the analysis, either the \(p\)-value table or a subset of the original numeric
variables containing only differential variables.
14.2 Used datasets
14.2.1 Generic case
The workflow will be illustrated on datasets clinical
and proteins
which
contain mixed-type variables and numeric variables respectively.
<- r_wrapp("r_import",
out_import1 input = "../forge/backend/R/data/protein.csv",
data.name = "proteins",
nature = "protein quantification", header = TRUE,
row.names = 1, sep = " ", normalized = "yes")
<- r_wrapp("r_import",
out_import2 input = "../forge/backend/R/data/clinical.csv",
data.name = "clinical", nature = "phenotypes",
header = TRUE, row.names = 1, sep = ",")
Both datasets need to be combined in a multipleDA
object as for
PLS-DA analysis, which also includes an information on the so-called
target variable (the categorical variable that, here, is one of the variables
of clinical
).
<- r_wrapp("r_combine_target", X = "proteins", Y = "clinical",
out_combtgt target = "patient.samples.sample.2.sample_type")
14.2.2 Count case
In addition, the count case will be illustrated with fake (simulated) count data:
# input_file <- "../forge/backend/R/data/fakeRNAseq.csv"
# out <- r_wrapp("r_import", input_file, data.name = "myrnaseq", row.names = 1,
# header = TRUE, nature = "rna-count", logt = "no",
# normalized = "no")
# input_file2 <- "../forge/backend/R/data/fakeCategorical.csv"
# out <- r_wrapp("r_import", input_file2, data.name = "categoricals",
# row.names = 1, header = TRUE, nature = "generic")
<- data.frame(matrix(abs(40*rnorm(7*1000)), nrow = 7))
tmprnaseq rownames(tmprnaseq) <- paste0("ind", 1:nrow(tmprnaseq))
colnames(tmprnaseq) <- paste0("Var", 1:ncol(tmprnaseq))
write.csv(tmprnaseq, file="../forge/backend/R/data/tmpRNASEQ.csv", sep=",")
<- "../forge/backend/R/data/tmpRNASEQ.csv"
input_file <- r_wrapp("r_import", input_file, data.name = "myrnaseq", row.names = 1,
out header = TRUE, nature = "rna-count", logt = "no", sep=",",
normalized = "no")
<- data.frame("version1" = rep(c("A", "B", "C"), each=3),
categoricals "version2" = rep(c("A", "B", "C"), 3))
rownames(categoricals) <- paste0("ind", 1:nrow(categoricals))
## counts, normalized, more than 2 levels
<- r_wrapp("r_norm_dataset", "myrnaseq", template = "rnaseqcount",
out_norm has_log = FALSE)
<- r_wrapp("r_extract_dataset", "normalized_1")
out_norm_extract <- r_wrapp("r_combine_target", "normalized_1", "categoricals",
out_combtgt2 "version2")
14.3 Multiple tests
14.3.1 Input
The inputs of r_multiple_tests
are:
datasetName
: the name of an object that is an output ofr_combine_target
, as described before;type
: type of the tests to use. It can be"p"
for parametric,"n"
for non-parametric,"a"
for automatic (the latter being the default value), or"c"
for counts. If"a"
is selected, then type"n"
or"p"
is automatically attributed depending on the results of normality tests."c"
is systematically used for count datasets that have not been log-transformed;correction
: the multiple testing correction method for parametric tests. It can be"BH"
(the default value, for control of the FDR),"bonferroni"
(for control of the FWER), or"none"
(no correction).threshold
: the risk threshold at which a test is considered as significant (based on corrected p-values). Default value is set to \(0.05\).
14.3.2 Output
<- r_multiple_tests("multipleDA_1", type = "p", threshold = 0.8)
out_multiple <- r_wrapp("r_multiple_tests", "multipleDA_1", type ="p",
out_multiple_wrapp threshold = 0.8)
names(out_multiple)
## [1] "Table" "Graphical" "Object"
The output of r_multiple_tests
function is a list including:
14.3.2.1 Table
elements
TestView
, a JSON table containing the names of all tested numeric variables, the test statistics, the p-values, the ajusted p-values, and (optionally, only if the categorical variable has two levels) the sign of the differences between means (parametric test) or medians (non parametric tests). This table is a table of typeBasicTable
:
::json_tree_view(
jsonview::toJSON(out_multiple$Table$TestView, pretty = TRUE,
jsonliteauto_unbox = TRUE),
scroll = TRUE)
DiffVar
, a list containing the names of the differentially expressed variables
head(out_multiple$Table$DiffVar)
## [1] "4E.BP1_pS65" "4E.BP1_pT37" "4E.BP1_pT70" "A.Raf_pS299" "ASNS"
## [6] "Akt_pS473"
AllVariables
(not to be displayed on the interface): the list of all variables (that has been tested)
head(out_multiple$Table$AllVariables)
## [1] "14.3.3_epsilon" "4E.BP1" "4E.BP1_pS65" "4E.BP1_pT37"
## [5] "4E.BP1_pT70" "53BP1"
Posthoc
(not to be displayed on the interface): a variable coded"yes"
if the categorical has more than two groups and that some variables are found differentially expressed (and that posthoc tests can be performed),"no"
otherwise
$Table$Posthoc out_multiple
## [1] "yes"
Posthocplot
(not to be displayed on the interface): a variable coded"yes"
ifPosthoc
is"yes"
and that, in the performed test is parametric (and that forest plots for posthoc tests can be obtained)
$Table$Posthocplot out_multiple
## [1] "yes"
HasSign
(not to be displayed on the interface): a variable coded"yes"
if the categorical variable has only 2 levels (and that the tableTestView
has asign
column)
$Table$HasSign out_multiple
## [1] "no"
ObjectName
: the name of the analysis object, to be passed to further analyses
$Table$ObjectName out_multiple
## [1] "DiffAnalysis_1"
14.3.2.2 Graphical
elements
Histogram
, a plot made of side-by-side histograms (p-values on the left, adjusted p-values on the right)
$Graphical$Histogram out_multiple
Heatmap
, an heatmap with individuals and levels of the tested categorical variable in rows, differential numeric variables in columns. If none variable is found differential, this plot is not provided
$Graphical$Heatmap out_multiple
14.3.2.3 Object
elements
Calling r_multiple_tests
through r_wrapp
creates two objects
DiffMultiple_*
and DiffAnalysis_*
:
DiffMultiple
, a hidden object that contains the table of test results (similar to what is in$Table$TestView
but under the form of adata.frame
)DiffAnalysis
, a hidden object made of the elements in$Table
and$Graphical
, displayed in the workspace DAG and passed to further functions.
14.3.3 Count test data
Count data tests are performed with:
<- r_multiple_tests("multipleDA_2", type = "c", threshold = 0.7) out_multiple2
and contain the same information than for the generic case. In particular,
$Table$Posthoc out_multiple2
## [1] "yes"
$Table$Posthocplot out_multiple2
## [1] "no"
$Table$HasSign out_multiple2
## [1] "no"
14.4 Posthoc tests
14.4.1 Input
r_posthoc_tests
takes a single input (datasetName
), that is the ObjectName
returned by r_multiple_tests.
14.4.2 Output
<- r_posthoc_tests("DiffAnalysis_1")
out_posthoc <- r_wrapp("r_posthoc_tests", "DiffAnalysis_1")
out_posthoc_wrapp names(out_posthoc)
## [1] "Table" "Object"
The output of r_posthoc_tests
function is a list including:
14.4.2.1 Table
elements
TestView
, a JSON table containing names of the posthoc-tested numeric variables, conditions, p-values and sign of the differences between means (in case parametric tests were performed) or differences between medians (otherwise). This table is of typeBasicTable
::json_tree_view(
jsonview::toJSON(out_posthoc$Table$TestView, pretty = TRUE,
jsonliteauto_unbox = TRUE),
scroll = TRUE)
SignifTestView
, a table containing the names of the posthoc-tested numeric variables and all conditions found significant. This table is of typeBasicTable
::json_tree_view(
jsonview::toJSON(out_posthoc$Table$SignifTestView, pretty = TRUE,
jsonliteauto_unbox = TRUE),
scroll = TRUE)
Conditions
(not to be displayed on the interface): a vector with the names of all tested pairs of conditions
$Table$Conditions out_posthoc
## [1] "metastatic-blood derived normal"
## [2] "primary tumor-blood derived normal"
## [3] "solid tissue normal-blood derived normal"
## [4] "primary tumor-metastatic"
## [5] "solid tissue normal-metastatic"
## [6] "solid tissue normal-primary tumor"
HasSign
(not to be displayed on the interface): a variable always equal to"yes"
and indicating that posthoc tests always have asign
columns in their$Table$TestView
$Table$HasSign out_posthoc
## [1] "yes"
ObjectName
he name of the analysis object, to be passed to further analyses
$Table$ObjectName out_posthoc
## [1] "DiffPosthocAnalysis_1"
14.4.2.2 Object
elements
Calling r_multiple_tests
through r_wrapp
creates two objects
DiffPosthocTable_*
and DiffPosthocAnalysis_1
:
DiffPosthocTable
, a hidden object that contains the table of test results (similar to what is in$Table$TestView
but under the form of adata.frame
)DiffPosthocAnalysis
, a hidden object made of the elements in$Table
, displayed in the workspace DAG and passed to further functions.
14.5 Forest plots for posthoc tests
14.5.1 Input
The inputs of r_posthoc_plots
function are:
datasetName
: the$Table$ObjectName
as given in the output ofr_multiple_tests
;variableName
: the name of the numeric variable (among the names of the variables for which a differential analysis has been performed and has led to a significant result, as in$Table$DiffVar
) to conduct pairwise comparisons on.
14.5.2 Output
<- r_posthoc_plots("DiffAnalysis_1", "4E.BP1")
out_forest <- r_wrapp("r_posthoc_plots", "DiffAnalysis_1", "4E.BP1")
out_forest_wrapp names(out_forest)
## [1] "Graphical"
The output of r_posthoc_tests
function is a list with only a Graphical
element (that is not stored in object_db
) that contains the forest plot:
$Graphical$ForestPlot out_forest
14.6 Data extraction
14.6.1 Input
The inputs of r_extract_diff
function are:
datasetName
: the name of an object that is the$Table$Object
of either the output ofr_multiple_tests
or the output ofr_posthoc_tests
;userName
: the name given to the object by the user;type
: the output type that can be either a subset of the original numeric dataset"dataset"
(default value) or the test table"results"
;conditions
(only if posthoc tests are passed as inputs): the conditions of interest to filter either the test results or the significant variables;sign
(used only iftype == "results"
and$Table$HasSign
of the output is"yes"
): sign of the variables (over-expressed, under-expressed or both) selected for the output. It is coded as"+"
,"-"
, or"both"
(default value);variables
(used only iftype == "results"
and optional): list of variables to select for the output;threshold
(used only whentype == "results"
and optional): maximum adjusted value to keep (risk threshold). IfNULL
, the default value is the threshold specified for the input analysis.
14.6.2 Output
<- r_extract_diff("DiffPosthocAnalysis_1", threshold = 0.7,
out_extract conditions = "metastatic-blood derived normal",
type = "results")
<- r_wrapp("r_extract_diff", "DiffPosthocAnalysis_1",
out_extract_wrapp threshold = 0.7,
conditions = "metastatic-blood derived normal",
type = "results")
names(out_extract)
## [1] "Object" "Table"
The output of r_extract_diff
function is a list including:
14.6.2.1 Table
elements
ObjectName
and UserName
, the names of the created object, for object_db
and for the user, respectively:
$Table$ObjectName out_extract
## [1] "DiffExtractRes_1"
$Table$UserName out_extract
## [1] "DiffExtractRes_1"
14.6.2.2 Object
elements
A data frame added to the user’s workspace:
head(out_extract$Object$OBJextract)
## variable condition pVal pValCor sign
## 85 GSK3.alpha.beta metastatic-blood derived normal 0.02977477 0.6483371 1
## 193 Transglutaminase metastatic-blood derived normal 0.03324805 0.6483371 1
14.6.3 Interface specification
Combinations of options depend on the source and of the type of output:
sign
is needed only when the corresponding analysis (multiple test or posthoc test) returns$Table$Sign$
equal to"yes"
. In this case, its value can be"both"
,"+"
or"-"
. In the other cases, it is automatically"both"
;conditions
is only valid for posthoc tests. By default, all pairs of conditions are returned (forconditions = NULL
);variables
is only valid for outputs of type"results"
. Similarly to conditions, all variables are returned (forvariables = NULL
);threshold
is only valid for outputs of type"results"
. The advised default value is 1, which corresponds to returning all results.