Chapter 13 MFA: Multiple Factor Analysis

This file describes the different steps of a MFA workflow and comments the inputs and outputs of the different functions needed to perform it. MFA deals with a set of individuals organized into different groups of variables. A group can be defined as a dataset. This allows us to analyze several datasets simultaneously and thus highlight the existing relationships between individuals, variables and datasets.

The following lines can safely be ignored. They are used to set up my working directory for this document.

13.1 Used datasets

The workflow will be illustrated on a combined dataset made of:

  • dataset clinical restricted to 6 variables (called clinical2): 3 categorical and 3 numeric;

  • dataset clinical restricted to 12 variables (called clinical3): 9 categorical and 3 numeric;

  • dataset mrna restricted to 100 variables.

Datasets were reduced for computation sake. clinical2 and clinical3 are made such that they contain at least 3 numeric and 3 categorical variables (to ensure that we will still have more than 2 variables no matter what variable type(s) the user chooses to keep). These variables have been chosen such that they are neither nonzero-variance variables, nor categorical variables with one single value, nor categorical variables with as many values as observations. This way, we know that no irrelevant variable will be removed from clinical2 and clinical3 during the data pre-processing at the beginning of r_mfa. It is more convenient for demonstration.

set.seed(1)

input <- "../forge/backend/R/data/clinical.csv"
r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)

input <- "../forge/backend/R/data/mrna.csv"
r_import(input, data.name = "mrna", row.names = 1, assign = TRUE)
mrna <- sample(mrna, size = 100)

input <- "../forge/backend/R/data/protein.csv"
r_import(input, data.name = "proteins", header = TRUE, sep = " ",
         quote = '\"', dec = ".", row.names = 1, assign = TRUE)

clinical2 <- clinical[, c("patient.age_at_initial_pathologic_diagnosis",
                          "patient.day_of_form_completion",
                          "patient.month_of_form_completion",
                          "patient.clinical_cqcf.consent_or_death_status",
                          "patient.clinical_cqcf.histological_type",
                          "patient.gender")]

clinical3 <- clinical[, c("patient.month_of_form_completion",
                          "patient.year_of_form_completion",
                          "patient.samples.sample.2.portions.portion.analytes.analyte.aliquots.aliquot.concentration",
                          "patient.stage_event.tnm_categories.pathologic_categories.pathologic_m" ,
                          "patient.biospecimen_cqcf.path_confirm_report_attached",
                          "patient.stage_event.tnm_categories.pathologic_categories.pathologic_t",
                          "patient.icd_10",
                          "patient.biospecimen_cqcf.path_confirm_tumor_nuclei_metrics",
                          "patient.icd_o_3_site" ,
                          "patient.samples.sample.2.is_ffpe",
                          "patient.vital_status",
                          "patient.bcr_canonical_check.bcr_patient_canonical_status")]

13.2 Running a MFA

13.2.1 Inputs

The inputs of r_mfa() function are:
- object: a combined object
- ncp: a number of MFA components
- user.types: a character sequence defining each group’s type

  • Combined object

The user chooses the menu “Integrate datasets with MFA” in the section “Integrate” and selects datasets clinical2, clinical3 and mrna. Input datasets are combined using the function r_combine_datasets() documented here. Common observations and descriptive statistics are retrieved from the combined object.

out_combdt <- r_wrapp("r_combine_datasets", c("clinical2", "clinical3", "mrna"))
  • Number of PC

The number of kept MFA components can be ignored (in which case the number of kept MFA components is set as the maximum possible number), or can be any integer strictly larger than 1 (if it is larger than the maximum possible number of MFA components, it is automatically corrected by r_mfa).

The maximum number of components ncp_max is defined as follows:

  • If the data do not contain any missing value, then ncp_max is set to the maximum number of components defined in FactoMineR::PCA, i.e. the minimum value between the total number of variables on one hand, and the number of observations minus 1 on the other hand.

  • If the data contain missing values, then ncp_max is set to the minimum value between the total number of variables minus 2 on one hand, and the number of observations minus 3 on the other hand (to be homogeneous with function imputeMFA from package missMDA, which is called at some point in r_mfa).

  • Sequence of types

Within a dataset, variables must be of a same type called the majority type. Consequently, each majority type has to be identified before running a MFA. Parameter user.types enables the user to specify the numeric and categorical majority type he wants within a given dataset. It only applies to mixed groups with numeric and categorical variables, which means that a group of variables with all variables numeric or categorical can’t be reassigned to another type. Further, the user may decide to split a dataset having both numeric and categorical variables or let the function decide according to a calculated split ratio. The available options are described below.

Categorical numeric Mixed Automatic
c n m a
  • c stands for categorical: if any numeric variable is found within the dataset, it will be removed from it.
  • n stands for numeric: if any categorical variable is found within the dataset, it will be removed from it as well.
  • m deals with mixed datasets including both types categorical and numeric. They are split into two groups; one numeric and the other categorical.
  • a for automatic selection applies the following rule: if 75% of the dataset are numeric or categorical then the 25% remaining will be deleted from it. If not, then the dataset will be split and will thus behave like m.

Note: If no types are defined within the groups, they are automatically detected by the function the following way:

  • a if the dataset contains both types of variables
  • n or c if the dataset contains respectively numeric variables only or categorical variables only.

13.2.2 Outputs

out_mfa <- r_mfa("combinedDF_1", ncp = 5)
attributes(out_mfa)
## $names
## [1] "Graphical" "Table"     "Object"

The output of r_mfa function is a list including:
- Object: MFA object
- Graphical: screegraph and cumulative percentage plot
- Table: table containing the name of the output and the table of inertia

The outputs to be displayed are a screegraph and a cumulative percentage plot.

When running out_mfa <- r_mfa(“combinedDF_1”), r_wrapp created an object, not directly accessible by the user, in the working directory, whose name is MFAobj_1.

The object name is included in the entry $Table$ObjectName of the output and returned in the output as

jsonview::json_tree_view(jsonlite::toJSON(out_mfa$Table$ObjectName,
                                      pretty = TRUE), height='100px')

Note: Successive command lines out_combdt <- r_wrapp(“r_combine_datasets”, c(“clinical2”, “clinical3”, “mrna”)) and out_mfa <- r_wrapp(“r_mfa”, “combinedDF_1”, ncp = 5) are equivalent to this simple call, which is what has to be done on the server side:

out_mfa_wrapped <- r_wrapp("r_mfa", 
                           datasetName = "combinedDF_1",
                           ncp = 5)

Percentage of explained variance by component and cumulative percentage of explained variance are included in the entry $Table$Inertia of the output and returned in the output

jsonview::json_tree_view(jsonlite::toJSON(out_mfa$Table$Inertia,
                                         pretty = TRUE), scroll=TRUE)

13.2.3 No types assigned

In the example below, we do not specify any type and set the number of MFA components to five.

out_mfa <- r_wrapp("r_mfa", "combinedDF_1", ncp = 5)

Remind that clinical2 is 50/50 balanced with 3 numeric variables and 3 categorical and clinical3 75/25 balanced with 3 numeric and 9 categorical. Therefore, MFA function’s natural behavior will be to separate clinical2 into two data types and to keep categorical variables in clinical3.

  • MFA object
MFAobj_1
## **Results of the Multiple Factor Analysis (MFA)**
## The analysis was performed on 988 individuals, described by 118 variables
## *Results are available in the following objects :
## 
##    name                 description                                    
## 1  "$eig"               "eigenvalues"                                  
## 2  "$separate.analyses" "separate analyses for each group of variables"
## 3  "$group"             "results for all the groups"                   
## 4  "$partial.axes"      "results for the partial axes"                 
## 5  "$inertia.ratio"     "inertia ratio"                                
## 6  "$ind"               "results for the individuals"                  
## 7  "$quanti.var"        "results for the quantitative variables"       
## 8  "$quali.var"         "results for the categorical variables"        
## 9  "$summary.quanti"    "summary for the quantitative variables"       
## 10 "$summary.quali"     "summary for the categorical variables"        
## 11 "$global.pca"        "results for the global PCA"
  • Screegraph
MFAobj_1$Graphical$Screegraph

  • Cumulative Percentage Plot
MFAobj_1$Graphical$CumulativePercentagePlot

13.2.4 Types assigned

In this section, we present three cases when the types are willingly assigned by the user.

  • Automatic - Numeric - Numeric

Clinical2: 50% numeric variables and 50% categorical variables are kept.
Clinical3: 25% numeric variables are kept and 75% categorical variables are removed.

out_mfa5ann <- r_mfa("combinedDF_1", ncp = 4, user.types = c("ann"))
##   clinical2n clinical2c clinical3   mrna
## 1     quanti      quali    quanti quanti
## 2          3          3         3    100
out_mfa5ann$Graphical$Screegraph

out_mfa5ann$Graphical$CumulativePercentagePlot

  • Categorical - Automatic - Numeric

Clinical2: 50% categorical variables are kept and 50% numeric variables are dropped.
Clinical3: 25% numeric variables are removed and 75% categorical are kept.

out_mfa5can <- r_mfa("combinedDF_1", ncp = 4, user.types = c("can"))
##   clinical2 clinical3n clinical3c   mrna
## 1     quali     quanti      quali quanti
## 2         3          3          9    100
out_mfa5can$Graphical$Screegraph

out_mfa5can$Graphical$CumulativePercentagePlot

  • Mixed - Mixed - Numeric

Clinical2: 50% numeric variables and 50% categorical variables are kept.
Clinical3: 25% numeric variables and 75% categorical variables are kept.

out_mfa5mmn <- r_mfa("combinedDF_1", ncp = 4, user.types = c("mmn"))
##   clinical2n clinical2c clinical3n clinical3c   mrna
## 1     quanti      quali     quanti      quali quanti
## 2          3          3          3          9    100
out_mfa5mmn$Graphical$Screegraph

out_mfa5mmn$Graphical$CumulativePercentagePlot

Note: Notice that the last group mrna does not include both types but 100 numeric variables. Therefore, the user can not change this group’s majority type to c otherwise an error message will be generated. 

out_mfa5mmc <- r_mfa("combinedDF_1", ncp = 4, user.types = c("mmc"))
## Error in check_mfa_types(MFAdataset, all_types, user.types): The information provided for dataset types does not match with the initial types. Groups with only categorical variables must be defined by 'c' and groups with only numerical variables must be defined by 'n'.

If all types are set to a, then the function will behave as if no types were assigned.

In the following sections, we only use the object obtained from the first MFA with no assigned types.

13.2.5 More outputs

Entry $Object$MFAobj of the output also provides the following results:

  • General information on each separate analysis, such as coordinates of individuals and variables, eigenvalues, and information on singular value decomposition. These results are stored in the $Object$MFAobj$separate.analyses entry, completed by the name of the analysis of interest (for example, $mrna).

  • Results for groups of individuals, such as Lg and RV coefficients, coordinates, square cosine, contributions, distance to the origin, correlations between each group and each factor. This information differs from the information on each separate analysis in $Object$MFAobj$separate.analyses, since each group is now considered as an individual. The mentioned information is available in the $Object$MFAobj$group entry, completed by the name of the result of interest (for example, $coord) and the name of the analysis of interest (for example, $clinical2n).

  • Results of MFA when the analysis is considered as a unique weighted PCA, such as coordinates of individuals and variables in the space of principal components, eigenvalues and percentages of explained variance. This information is available in $Object$MFAobj$global.pca.

13.3 Information on how r_mfa works to perform different MFA on the same dataset

If a MFA has already been performed on a given dataset with the same number ncp, r_mfa loads the previous analysis and returns only the plots (and the object name as described above) but not the object.

If a MFA has already been performed on a given dataset with a smaller number ncp, r_mfa loads the previous analysis, extracts the proper number of MFA components from the previous analysis and returns the plots and a cut object (with its name).

13.4 Display individuals

The functions r_plotind() and r_plotvar() previously used for PCA workflow are also called in this analysis to display individuals.

13.4.1 Individuals

The output of r_mfa contains a dataset that corresponds to individual coordinates in the space of MFA components. The number of columns in this dataset is the number of MFA components. Individuals are displayed using the function r_multivariate_dotplot, in which the first two datasets are necessarily out_mfa$Object$MFAobj$ind$coord and the first two variables are of the form Dim.X where X is a number between 1 and npc. The simplest version will be:

out_plotind <- r_plotind("MFAobj_1")
plotly::ggplotly(out_plotind$Graphical$PlotInd)

More complicated outputs can be obtained using at most 3 different other variables (to be passed to the function in that order):

  • colour (can be numeric or factor or character; in the last case, the variable is silently changed into factor)

  • shape (necessarily factor but with not too many values)

  • size (necessarily numeric)

out_plotind <- r_plotind("MFAobj_1", axis1 = 1, axis2 = 3, "clinical",
                         "patient.age_at_initial_pathologic_diagnosis")
plotly::ggplotly(out_plotind$Graphical$PlotInd)
out_plotind <- r_plotind("MFAobj_1", axis1 = 1, axis2 = 2, 
                         datasetcolor="clinical",
                         varcolor="patient.vital_status",
                         datasetshape="clinical", 
                         varshape="patient.gender", 
                         datasetsize="clinical",
                         varsize="patient.age_at_initial_pathologic_diagnosis")
plotly::ggplotly(out_plotind$Graphical$PlotInd)

13.4.2 Partial individuals of selected observations

The user selects a set of individuals that he wants to display with their partial individuals. It allows to study each observation seen by each group and its barycenter.

  • Simple plot object
set.seed(30)
random_ind <- sample(combinedDF_1, size = 3)
out_plotlind <- r_plotind("MFAobj_1", partialInd = random_ind)
plotly::ggplotly(out_plotlind$Graphical$PlotInd)
  • Complex plot object
out_plotind <- r_plotind("MFAobj_1", 
                         datasetsize = "proteins", varsize = "Bak", 
                         datasetshape = "clinical2", varshape = "patient.gender", 
                         partialInd = random_ind)
 plotly::ggplotly(out_plotind$Graphical$PlotInd)

13.4.3 Partial individuals of a selected group

The user selects a specific group and all partial individuals belonging to that group will be shown.

  • Simple plot object
out_plotind <- r_plotind("MFAobj_1", partialIndGroup = "mrna")
plotly::ggplotly(out_plotind$Graphical$PlotInd)
  • Complex plot object
out_plotind <- r_plotind("MFAobj_1", 
                         datasetsize = "proteins", varsize = "Bak",
                         datasetshape = "clinical2", varshape = "patient.gender",  
                         partialIndGroup = "mrna")
plotly::ggplotly(out_plotind$Graphical$PlotInd)

13.5 Display variables

Numerical variables are displayed on the correlation circle, whereas categorical variables (or rather categories themselves) are displayed on PCs. Both can be filtered using a correlation threshold.

13.5.1 Display numerical variables

out_plotvar <- r_plotvar("MFAobj_1")
plotly::ggplotly(out_plotvar$Graphical$PlotVar)

Note: Only groups having numeric as majority type are displayed.

13.5.2 Display categories

plotly::ggplotly(out_plotvar$Graphical$PlotCatVar)

13.6 Display groups

The groups plot resumes relationships between the different groups calling the new function r_plotgroup().

out_plotgroup <- r_plotgroup("MFAobj_1")

The quality of representation of each group is given in the following graph:

plotly::ggplotly(out_plotgroup$Graphical$PlotGroup)

We now plot relationships between the MFA components and the components obtained from MFA in each group.

plotly::ggplotly(out_plotgroup$Graphical$PlotAxes)

13.7 Extract a new dataset

A new dataset can be extracted from the MFA analysis either

  • directly using the first MFA components as new variables
out_extract <- r_extract_obj("MFAobj_1", criterion = "axes", ncp = 3)

It is contained in out_extract$Object$OBJextract and added to the workspace under the name MFAaxes_1 by r_wrapp.

  • using MFA components containing most of the explained variance as new variables. The number of selected MFA components depends on the cumulative percentage of explained variance we want to extract
out_extract_2 <- r_extract_obj("MFAobj_1", criterion = "axes", cumVar = 10)

It is contained in out_extract_2$Object$OBJextract and added to the workspace under the name MFAaxes_1 by r_wrapp.

  • or using the variables the most correlated to the first MFA components
out_extract_3 <- r_extract_obj("MFAobj_1", criterion = "correlation", 
                               ncp = 5, threshold.cor = 0.8)

It is contained in out_extract_3$Object$OBJextract and added to the workspace under the name MFAsel_1 by r_wrapp.

I guess that this should be handled by a button extracting the first ncp MFA components into a new dataset. Also, the dataset type is automatically set to

  • for the first and second cases: dataset and dataset, respectively
  • and to the same type than the original dataset for the third case (here, dataset)