Chapter 13 MFA: Multiple Factor Analysis
This file describes the different steps of a MFA workflow and comments the inputs and outputs of the different functions needed to perform it. MFA deals with a set of individuals organized into different groups of variables. A group can be defined as a dataset. This allows us to analyze several datasets simultaneously and thus highlight the existing relationships between individuals, variables and datasets.
The following lines can safely be ignored. They are used to set up my working directory for this document.
13.1 Used datasets
The workflow will be illustrated on a combined dataset made of:
dataset
clinical
restricted to 6 variables (calledclinical2
): 3 categorical and 3 numeric;dataset
clinical
restricted to 12 variables (calledclinical3
): 9 categorical and 3 numeric;dataset
mrna
restricted to 100 variables.
Datasets were reduced for computation sake.
clinical2
and clinical3
are made such that they contain at least 3 numeric
and 3 categorical variables (to ensure that we will still have more than 2
variables no matter what variable type(s) the user chooses to keep).
These variables have been chosen such that they are neither nonzero-variance
variables, nor categorical variables with one single value, nor categorical
variables with as many values as observations. This way, we know that no
irrelevant variable will be removed from clinical2
and clinical3
during the
data pre-processing at the beginning of r_mfa
. It is more convenient for
demonstration.
set.seed(1)
<- "../forge/backend/R/data/clinical.csv"
input r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)
<- "../forge/backend/R/data/mrna.csv"
input r_import(input, data.name = "mrna", row.names = 1, assign = TRUE)
<- sample(mrna, size = 100)
mrna
<- "../forge/backend/R/data/protein.csv"
input r_import(input, data.name = "proteins", header = TRUE, sep = " ",
quote = '\"', dec = ".", row.names = 1, assign = TRUE)
<- clinical[, c("patient.age_at_initial_pathologic_diagnosis",
clinical2 "patient.day_of_form_completion",
"patient.month_of_form_completion",
"patient.clinical_cqcf.consent_or_death_status",
"patient.clinical_cqcf.histological_type",
"patient.gender")]
<- clinical[, c("patient.month_of_form_completion",
clinical3 "patient.year_of_form_completion",
"patient.samples.sample.2.portions.portion.analytes.analyte.aliquots.aliquot.concentration",
"patient.stage_event.tnm_categories.pathologic_categories.pathologic_m" ,
"patient.biospecimen_cqcf.path_confirm_report_attached",
"patient.stage_event.tnm_categories.pathologic_categories.pathologic_t",
"patient.icd_10",
"patient.biospecimen_cqcf.path_confirm_tumor_nuclei_metrics",
"patient.icd_o_3_site" ,
"patient.samples.sample.2.is_ffpe",
"patient.vital_status",
"patient.bcr_canonical_check.bcr_patient_canonical_status")]
13.2 Running a MFA
13.2.1 Inputs
The inputs of r_mfa()
function are:
- object
: a combined object
- ncp
: a number of MFA components
- user.types
: a character sequence defining each group’s type
- Combined object
The user chooses the menu “Integrate datasets with MFA” in the section “Integrate”
and selects datasets clinical2
, clinical3
and mrna
.
Input datasets are combined using the function r_combine_datasets()
documented here.
Common observations and descriptive statistics are retrieved from the combined
object.
<- r_wrapp("r_combine_datasets", c("clinical2", "clinical3", "mrna")) out_combdt
- Number of PC
The number of kept MFA components can be ignored (in which case the number
of kept MFA components is set as the maximum possible number), or can be
any integer strictly larger than 1 (if it is larger than the maximum
possible number of MFA components, it is automatically corrected by r_mfa
).
The maximum number of components ncp_max
is defined as follows:
If the data do not contain any missing value, then
ncp_max
is set to the maximum number of components defined inFactoMineR::PCA
, i.e. the minimum value between the total number of variables on one hand, and the number of observations minus 1 on the other hand.
If the data contain missing values, then
ncp_max
is set to the minimum value between the total number of variables minus 2 on one hand, and the number of observations minus 3 on the other hand (to be homogeneous with functionimputeMFA
from packagemissMDA
, which is called at some point inr_mfa
).
Sequence of types
Within a dataset, variables must be of a same type called the
majority type. Consequently, each majority type has to be identified before
running a MFA. Parameter user.types
enables the user to specify the
numeric and categorical majority type he wants within a given dataset. It only
applies to mixed groups with numeric and categorical variables, which means that
a group of variables with all variables numeric or categorical can’t be
reassigned to another type. Further, the user may decide to split a dataset
having both numeric and categorical variables or let the function decide
according to a calculated split ratio. The available options are described below.
Categorical | numeric | Mixed | Automatic |
---|---|---|---|
c | n | m | a |
c
stands for categorical: if any numeric variable is found within the dataset, it will be removed from it.
n
stands for numeric: if any categorical variable is found within the dataset, it will be removed from it as well.
m
deals with mixed datasets including both types categorical and numeric. They are split into two groups; one numeric and the other categorical.
a
for automatic selection applies the following rule: if 75% of the dataset are numeric or categorical then the 25% remaining will be deleted from it. If not, then the dataset will be split and will thus behave likem
.
Note: If no types are defined within the groups, they are automatically detected by the function the following way:
a
if the dataset contains both types of variables
n
orc
if the dataset contains respectively numeric variables only or categorical variables only.
13.2.2 Outputs
<- r_mfa("combinedDF_1", ncp = 5)
out_mfa attributes(out_mfa)
## $names
## [1] "Graphical" "Table" "Object"
The output of r_mfa
function is a list including:
- Object
: MFA object
- Graphical
: screegraph and cumulative percentage plot
- Table
: table containing the name of the output and the table of inertia
The outputs to be displayed are a screegraph and a cumulative percentage plot.
When running out_mfa <- r_mfa(“combinedDF_1”)
,
r_wrapp
created an object, not directly accessible by the user, in the
working directory, whose name is MFAobj_1.
The object name is included in the entry $Table$ObjectName
of the output and
returned in the output as
::json_tree_view(jsonlite::toJSON(out_mfa$Table$ObjectName,
jsonviewpretty = TRUE), height='100px')
Note: Successive command lines
out_combdt <- r_wrapp(“r_combine_datasets”, c(“clinical2”, “clinical3”, “mrna”))
and out_mfa <- r_wrapp(“r_mfa”, “combinedDF_1”, ncp = 5)
are equivalent to this simple call, which is what has to be done on the server
side:
<- r_wrapp("r_mfa",
out_mfa_wrapped datasetName = "combinedDF_1",
ncp = 5)
Percentage of explained variance by component and cumulative percentage of
explained variance are included in the entry $Table$Inertia
of the output and
returned in the output
::json_tree_view(jsonlite::toJSON(out_mfa$Table$Inertia,
jsonviewpretty = TRUE), scroll=TRUE)
13.2.3 No types assigned
In the example below, we do not specify any type and set the number of MFA components to five.
<- r_wrapp("r_mfa", "combinedDF_1", ncp = 5) out_mfa
Remind that clinical2 is 50/50 balanced with 3 numeric variables and 3 categorical and clinical3 75/25 balanced with 3 numeric and 9 categorical. Therefore, MFA function’s natural behavior will be to separate clinical2 into two data types and to keep categorical variables in clinical3.
- MFA object
MFAobj_1
## **Results of the Multiple Factor Analysis (MFA)**
## The analysis was performed on 988 individuals, described by 118 variables
## *Results are available in the following objects :
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$separate.analyses" "separate analyses for each group of variables"
## 3 "$group" "results for all the groups"
## 4 "$partial.axes" "results for the partial axes"
## 5 "$inertia.ratio" "inertia ratio"
## 6 "$ind" "results for the individuals"
## 7 "$quanti.var" "results for the quantitative variables"
## 8 "$quali.var" "results for the categorical variables"
## 9 "$summary.quanti" "summary for the quantitative variables"
## 10 "$summary.quali" "summary for the categorical variables"
## 11 "$global.pca" "results for the global PCA"
- Screegraph
$Graphical$Screegraph MFAobj_1
- Cumulative Percentage Plot
$Graphical$CumulativePercentagePlot MFAobj_1
13.2.4 Types assigned
In this section, we present three cases when the types are willingly assigned by the user.
- Automatic - Numeric - Numeric
Clinical2: 50% numeric variables and 50% categorical variables are kept.
Clinical3: 25% numeric variables are kept and 75% categorical variables are removed.
<- r_mfa("combinedDF_1", ncp = 4, user.types = c("ann")) out_mfa5ann
## clinical2n clinical2c clinical3 mrna
## 1 quanti quali quanti quanti
## 2 3 3 3 100
$Graphical$Screegraph out_mfa5ann
$Graphical$CumulativePercentagePlot out_mfa5ann
- Categorical - Automatic - Numeric
Clinical2: 50% categorical variables are kept and 50% numeric variables are dropped.
Clinical3: 25% numeric variables are removed and 75% categorical are kept.
<- r_mfa("combinedDF_1", ncp = 4, user.types = c("can")) out_mfa5can
## clinical2 clinical3n clinical3c mrna
## 1 quali quanti quali quanti
## 2 3 3 9 100
$Graphical$Screegraph out_mfa5can
$Graphical$CumulativePercentagePlot out_mfa5can
- Mixed - Mixed - Numeric
Clinical2: 50% numeric variables and 50% categorical variables are kept.
Clinical3: 25% numeric variables and 75% categorical variables are kept.
<- r_mfa("combinedDF_1", ncp = 4, user.types = c("mmn")) out_mfa5mmn
## clinical2n clinical2c clinical3n clinical3c mrna
## 1 quanti quali quanti quali quanti
## 2 3 3 3 9 100
$Graphical$Screegraph out_mfa5mmn
$Graphical$CumulativePercentagePlot out_mfa5mmn
Note: Notice that the last group mrna
does not include both types but 100
numeric variables. Therefore, the user can not change this group’s majority type
to c
otherwise an error message will be generated.
<- r_mfa("combinedDF_1", ncp = 4, user.types = c("mmc")) out_mfa5mmc
## Error in check_mfa_types(MFAdataset, all_types, user.types): The information provided for dataset types does not match with the initial types. Groups with only categorical variables must be defined by 'c' and groups with only numerical variables must be defined by 'n'.
If all types are set to a
, then the function will behave as if no types
were assigned.
In the following sections, we only use the object obtained from the first MFA with no assigned types.
13.2.5 More outputs
Entry $Object$MFAobj
of the output also provides the following results:
General information on each separate analysis, such as coordinates of individuals and variables, eigenvalues, and information on singular value decomposition. These results are stored in the
$Object$MFAobj$separate.analyses
entry, completed by the name of the analysis of interest (for example,$mrna
).Results for groups of individuals, such as Lg and RV coefficients, coordinates, square cosine, contributions, distance to the origin, correlations between each group and each factor. This information differs from the information on each separate analysis in
$Object$MFAobj$separate.analyses
, since each group is now considered as an individual. The mentioned information is available in the$Object$MFAobj$group
entry, completed by the name of the result of interest (for example,$coord
) and the name of the analysis of interest (for example,$clinical2n
).Results of MFA when the analysis is considered as a unique weighted PCA, such as coordinates of individuals and variables in the space of principal components, eigenvalues and percentages of explained variance. This information is available in
$Object$MFAobj$global.pca
.
13.3 Information on how r_mfa
works to perform different MFA on the same dataset
If a MFA has already been performed on a given dataset with the same number
ncp
, r_mfa
loads the previous analysis and returns only the plots (and the
object name as described above) but not the object.
If a MFA has already been performed on a given dataset with a smaller number
ncp
, r_mfa
loads the previous analysis, extracts the proper number of
MFA components from the previous analysis and returns the plots and a cut object
(with its name).
13.4 Display individuals
The functions r_plotind()
and r_plotvar()
previously
used for
PCA workflow
are also called in this analysis to display individuals.
13.4.1 Individuals
The output of r_mfa
contains a dataset that corresponds to individual
coordinates in the space of MFA components. The number of columns in this
dataset is the number of MFA components. Individuals are displayed using the
function r_multivariate_dotplot
, in which the first two datasets are
necessarily out_mfa$Object$MFAobj$ind$coord
and the first two variables are of
the form Dim.X
where X
is a number between 1 and npc
. The simplest version
will be:
<- r_plotind("MFAobj_1")
out_plotind ::ggplotly(out_plotind$Graphical$PlotInd) plotly
More complicated outputs can be obtained using at most 3 different other variables (to be passed to the function in that order):
colour
(can benumeric
orfactor
orcharacter
; in the last case, the variable is silently changed into factor)shape
(necessarilyfactor
but with not too many values)size
(necessarilynumeric
)
<- r_plotind("MFAobj_1", axis1 = 1, axis2 = 3, "clinical",
out_plotind "patient.age_at_initial_pathologic_diagnosis")
::ggplotly(out_plotind$Graphical$PlotInd) plotly
<- r_plotind("MFAobj_1", axis1 = 1, axis2 = 2,
out_plotind datasetcolor="clinical",
varcolor="patient.vital_status",
datasetshape="clinical",
varshape="patient.gender",
datasetsize="clinical",
varsize="patient.age_at_initial_pathologic_diagnosis")
::ggplotly(out_plotind$Graphical$PlotInd) plotly
13.4.2 Partial individuals of selected observations
The user selects a set of individuals that he wants to display with their partial individuals. It allows to study each observation seen by each group and its barycenter.
- Simple plot object
set.seed(30)
<- sample(combinedDF_1, size = 3)
random_ind <- r_plotind("MFAobj_1", partialInd = random_ind)
out_plotlind ::ggplotly(out_plotlind$Graphical$PlotInd) plotly
- Complex plot object
<- r_plotind("MFAobj_1",
out_plotind datasetsize = "proteins", varsize = "Bak",
datasetshape = "clinical2", varshape = "patient.gender",
partialInd = random_ind)
::ggplotly(out_plotind$Graphical$PlotInd) plotly
13.4.3 Partial individuals of a selected group
The user selects a specific group and all partial individuals belonging to that group will be shown.
- Simple plot object
<- r_plotind("MFAobj_1", partialIndGroup = "mrna")
out_plotind ::ggplotly(out_plotind$Graphical$PlotInd) plotly
- Complex plot object
<- r_plotind("MFAobj_1",
out_plotind datasetsize = "proteins", varsize = "Bak",
datasetshape = "clinical2", varshape = "patient.gender",
partialIndGroup = "mrna")
::ggplotly(out_plotind$Graphical$PlotInd) plotly
13.5 Display variables
Numerical variables are displayed on the correlation circle, whereas categorical variables (or rather categories themselves) are displayed on PCs. Both can be filtered using a correlation threshold.
13.6 Display groups
The groups plot resumes relationships between the different groups calling
the new function r_plotgroup()
.
<- r_plotgroup("MFAobj_1") out_plotgroup
The quality of representation of each group is given in the following graph:
::ggplotly(out_plotgroup$Graphical$PlotGroup) plotly
We now plot relationships between the MFA components and the components obtained from MFA in each group.
::ggplotly(out_plotgroup$Graphical$PlotAxes) plotly
13.7 Extract a new dataset
A new dataset can be extracted from the MFA analysis either
- directly using the first MFA components as new variables
<- r_extract_obj("MFAobj_1", criterion = "axes", ncp = 3) out_extract
It is contained in out_extract$Object$OBJextract
and added to the workspace
under the name MFAaxes_1 by
r_wrapp
.
- using MFA components containing most of the explained variance as new variables. The number of selected MFA components depends on the cumulative percentage of explained variance we want to extract
<- r_extract_obj("MFAobj_1", criterion = "axes", cumVar = 10) out_extract_2
It is contained in out_extract_2$Object$OBJextract
and added to the workspace
under the name MFAaxes_1 by
r_wrapp
.
- or using the variables the most correlated to the first MFA components
<- r_extract_obj("MFAobj_1", criterion = "correlation",
out_extract_3 ncp = 5, threshold.cor = 0.8)
It is contained in out_extract_3$Object$OBJextract
and added to the workspace
under the name MFAsel_1 by
r_wrapp
.
I guess that this should be handled by a button extracting the first ncp
MFA components into a new dataset. Also, the dataset type is automatically set to
- for the first and second cases: dataset and dataset, respectively
- and to the same type than the original dataset for the third case (here, dataset)