Chapter 9 PCA: Component Principal Analysis

This file describes the different steps of a PCA workflow and comments the inputs and outputs of the different functions needed to perform it.

The following lines can safely be ignored. They are used to set up my working directory for this document.

9.1 Used datasets

The workflow will be illustrated on the protein dataset (main dataset) and the clinical dataset (used as a metadata dataset which will provide additional information to display with the results of the dataset).

proteins <- read.table("../forge/backend/R/data/protein.csv", sep = " ", 
                       quote = '\"', dec = ".", row.names = 1)
clinical <- read.table("../forge/backend/R/data/clinical.csv", sep = ",", 
                       quote = '\"', dec = ".", row.names = 1)

9.2 Running a PCA

9.2.1 First call

The user choose the menu “PCA” and selects the dataset protein (one of the datasets with more than 3 numerical variables). The number of kept PC is set to NULL (which means that the number of kept PC will be the maximum possible number), here, but can be any integer greater than 1 otherwise (if it is larger than the maximum possible number of PCs, it is automatically corrected by r_famd).

out_pca <- r_famd("proteins", pca=TRUE)

This function produces two plots, both displayed to the user:

plotly::ggplotly(out_pca$Graphical$Screegraph)
plotly::ggplotly(out_pca$Graphical$CumulativePercentagePlot)

and r_wrapp creates an object, not directly accessible by the user, in the working directory, whose name is PCAobj_1.

The object name is included in the entry $Table$ObjectName of the output and returned in the output as

jsonview::json_tree_view(jsonlite::toJSON(out_pca$Table$ObjectName, pretty = TRUE),
                         height = '100px')

Note: The combination of the two previous command lines is equivalent to this simple call, which is what has to be done on the server side:

out_pca_wrapped <- r_wrapp("r_famd", datasetName = "proteins", pca=TRUE)

Percentage of explained variance by component and cumulative percentage of explained variance are included in the entry $Table$Inertia of the output and returned in the output as

jsonview::json_tree_view(jsonlite::toJSON(out_pca$Table$Inertia, pretty = TRUE), 
                         scroll = TRUE)

9.2.2 Information on how r_famd works to perform different PCA on the same dataset

If a PCA has already been performed on a given dataset with the same number ncp, r_famd loads the previous analysis and returns only the plots (and the object name as described above) but not the object.

If a PCA has already been performed on a given dataset with a smaller number ncp, r_famd loads the previous analysis, extract the proper number of PC from the previous analysis and returns the plots and a cut object (with its name).

9.3 Display individuals

The output of r_famd contains two datasets that correspond to individual and variable coordinates. The number of columns in these datasets is the number of PCs. Individuals are displayed using the function r_multivariate_dotplot, in which the first two datasets are necessarily out_pca$Object$PCAobj$ind$coord and the first two variables are of the form Dim.X where X is a number between 1 and npc. The simplest version will be:

out_plotind <- r_plotind("PCAobj_1")
plotly::ggplotly(out_plotind$Graphical$PlotInd)

More complicated outputs can be obtained using at most 3 different other variables (to be passed to the function in that order):

  • colour (can be numeric or factor or character; in the last case, the variable is silently changed into factor)

  • shape (necessarily factor but with not too many values)

  • size (necessarily numeric)

out_plotind <- r_plotind("PCAobj_1", axis1 = 1, axis2 = 3, "clinical",
                         "patient.age_at_initial_pathologic_diagnosis")
plotly::ggplotly(out_plotind$Graphical$PlotInd)
out_plotind <- r_plotind("PCAobj_1", axis1 = 1, axis2 = 2, "clinical",
                         "patient.age_at_initial_pathologic_diagnosis",
                         "clinical", "patient.gender", "clinical",
                         "patient.samples.sample.2.portions.portion.analytes.analyte.aliquots.aliquot.concentration")
plotly::ggplotly(out_plotind$Graphical$PlotInd)

9.4 Display variables

The output of r_famd contains an object in Object, named PCAobj, which can be passed directly to the function r_plotvar. On the interface, the user simply has to select the number of the two PCs to display, which can be changed interactively to produce a new plot, and (optionaly) to select a threshold level to display only certain variables. The threshold correspond to an overall minimum correlation of the variables with the two displayed axes.

out_plotvar <- r_plotvar("PCAobj_1")
plotly::ggplotly(out_plotvar$Graphical$PlotVar)
out_plotvar <- r_plotvar("PCAobj_1", axis1 = 1, axis2 = 3, 
                         threshold.cor = 0.5)
plotly::ggplotly(out_plotvar$Graphical$PlotVar)

9.5 Extract a new dataset

A new dataset can be extracted from the PCA analysis either

  • directly using the first PCs as new variables
out_extract <- r_extract_obj("PCAobj_1", criterion = "axes", ncp = 5)

It is contained in out_extract$Object$OBJextract and added to the workspace under the name PCAaxes_1 by r_wrapp.

  • using PCs containing most of the explained variance as new variables. The number of selected PCs depends on the cumulative percentage of explained variance we want to extract
out_extract_2 <- r_extract_obj("PCAobj_1", criterion = "axes", cumVar = 50)

It is contained in out_extract_2$Object$OBJextract and added to the workspace under the name PCAaxes_1 by r_wrapp.

  • or using the variables the most correlated to the first PCs
out_extract_3 <- r_extract_obj("PCAobj_1", criterion = "correlation", 
                               ncp = 5, threshold.cor = 0.8)

It is contained in out_extract_3$Object$OBJextract and added to the workspace under the name PCAsel_1 by r_wrapp.

I guess that this should be handled by a button extracting the first ncp PCs into a new dataset. Also, the dataset type is automatically set to

  • for the first and second cases: dataset and dataset, respectively
  • and to the same type than the original dataset for the third case (here, dataset)

9.6 Running an FAMD

FAMD stands for Factor Analysis of Mixed Data and is essentially an extension of PCA to categorical variables. FAMD is the default method of the function r_famd, although a PCA is performed when all variables are numerical, which is the case in the protein dataset. In FAMD, all outputs from PCA are kept and can be read and interpreted in the same way. In addition, a plot dedicated to categorical variables is output by r_plotvar. It shows the variables’ categories (levels) in projection on the PC axes, at the barycentre of the individuals who have them. Like numerical variables on the correlation circle, categories can be filtered by a correlation threshold. To illustrate this output, we will run an FAMD on the clinical dataset.

out_famd <- r_wrapp("r_famd", "clinical")
out_plotvar <- r_plotvar("PCAobj_2")
plotly::ggplotly(out_plotvar$Graphical$PlotCatVar)
out_plotvar <- r_plotvar("PCAobj_2", threshold.cor = 0.5)
plotly::ggplotly(out_plotvar$Graphical$PlotCatVar)