Chapter 9 PCA: Component Principal Analysis
This file describes the different steps of a PCA workflow and comments the inputs and outputs of the different functions needed to perform it.
The following lines can safely be ignored. They are used to set up my working directory for this document.
9.1 Used datasets
The workflow will be illustrated on the protein dataset (main dataset) and the clinical dataset (used as a metadata dataset which will provide additional information to display with the results of the dataset).
<- read.table("../forge/backend/R/data/protein.csv", sep = " ",
proteins quote = '\"', dec = ".", row.names = 1)
<- read.table("../forge/backend/R/data/clinical.csv", sep = ",",
clinical quote = '\"', dec = ".", row.names = 1)
9.2 Running a PCA
9.2.1 First call
The user choose the menu “PCA” and selects the dataset protein
(one of the
datasets with more than 3 numerical variables). The number of kept PC is set to
NULL
(which means that the number of kept PC will be the maximum possible
number), here, but can be any integer greater than 1 otherwise (if it is larger
than the maximum possible number of PCs, it is automatically corrected by r_famd
).
<- r_famd("proteins", pca=TRUE) out_pca
This function produces two plots, both displayed to the user:
::ggplotly(out_pca$Graphical$Screegraph) plotly
::ggplotly(out_pca$Graphical$CumulativePercentagePlot) plotly
and r_wrapp
creates an object, not directly accessible by the user, in the
working directory, whose name is
PCAobj_1.
The object name is included in the entry $Table$ObjectName
of the output and
returned in the output as
::json_tree_view(jsonlite::toJSON(out_pca$Table$ObjectName, pretty = TRUE),
jsonviewheight = '100px')
Note: The combination of the two previous command lines is equivalent to this simple call, which is what has to be done on the server side:
<- r_wrapp("r_famd", datasetName = "proteins", pca=TRUE) out_pca_wrapped
Percentage of explained variance by component and cumulative percentage of
explained variance are included in the entry $Table$Inertia
of the output
and returned in the output as
::json_tree_view(jsonlite::toJSON(out_pca$Table$Inertia, pretty = TRUE),
jsonviewscroll = TRUE)
9.2.2 Information on how r_famd
works to perform different PCA on the same dataset
If a PCA has already been performed on a given dataset with the same number
ncp
, r_famd
loads the previous analysis and returns only the plots (and the
object name as described above) but not the object.
If a PCA has already been performed on a given dataset with a smaller number
ncp
, r_famd
loads the previous analysis, extract the proper number of PC from
the previous analysis and returns the plots and a cut object (with its name).
9.3 Display individuals
The output of r_famd
contains two datasets that correspond to
individual and variable coordinates. The number of columns in these datasets
is the number of PCs. Individuals are displayed using the function
r_multivariate_dotplot
, in which the first two datasets are necessarily
out_pca$Object$PCAobj$ind$coord
and the first two variables are of the form
Dim.X
where X
is a number between 1 and npc
. The simplest version will be:
<- r_plotind("PCAobj_1")
out_plotind ::ggplotly(out_plotind$Graphical$PlotInd) plotly
More complicated outputs can be obtained using at most 3 different other variables (to be passed to the function in that order):
colour
(can benumeric
orfactor
orcharacter
; in the last case, the variable is silently changed into factor)shape
(necessarilyfactor
but with not too many values)size
(necessarilynumeric
)
<- r_plotind("PCAobj_1", axis1 = 1, axis2 = 3, "clinical",
out_plotind "patient.age_at_initial_pathologic_diagnosis")
::ggplotly(out_plotind$Graphical$PlotInd) plotly
<- r_plotind("PCAobj_1", axis1 = 1, axis2 = 2, "clinical",
out_plotind "patient.age_at_initial_pathologic_diagnosis",
"clinical", "patient.gender", "clinical",
"patient.samples.sample.2.portions.portion.analytes.analyte.aliquots.aliquot.concentration")
::ggplotly(out_plotind$Graphical$PlotInd) plotly
9.4 Display variables
The output of r_famd
contains an object in Object
, named PCAobj
, which can
be passed directly to the function r_plotvar
. On the interface, the user
simply has to select the number of the two PCs to display, which can be changed
interactively to produce a new plot, and (optionaly) to select a threshold
level to display only certain variables. The threshold correspond to an overall
minimum correlation of the variables with the two displayed axes.
<- r_plotvar("PCAobj_1")
out_plotvar ::ggplotly(out_plotvar$Graphical$PlotVar) plotly
<- r_plotvar("PCAobj_1", axis1 = 1, axis2 = 3,
out_plotvar threshold.cor = 0.5)
::ggplotly(out_plotvar$Graphical$PlotVar) plotly
9.5 Extract a new dataset
A new dataset can be extracted from the PCA analysis either
- directly using the first PCs as new variables
<- r_extract_obj("PCAobj_1", criterion = "axes", ncp = 5) out_extract
It is contained in out_extract$Object$OBJextract
and added to the workspace
under the name PCAaxes_1 by
r_wrapp
.
- using PCs containing most of the explained variance as new variables. The number of selected PCs depends on the cumulative percentage of explained variance we want to extract
<- r_extract_obj("PCAobj_1", criterion = "axes", cumVar = 50) out_extract_2
It is contained in out_extract_2$Object$OBJextract
and added to the workspace
under the name PCAaxes_1 by
r_wrapp
.
- or using the variables the most correlated to the first PCs
<- r_extract_obj("PCAobj_1", criterion = "correlation",
out_extract_3 ncp = 5, threshold.cor = 0.8)
It is contained in out_extract_3$Object$OBJextract
and added to the workspace
under the name PCAsel_1 by
r_wrapp
.
I guess that this should be handled by a button extracting the first ncp
PCs
into a new dataset. Also, the dataset type is automatically set to
- for the first and second cases: dataset and dataset, respectively
- and to the same type than the original dataset for the third case (here, dataset)
9.6 Running an FAMD
FAMD stands for Factor Analysis of Mixed Data and is essentially an extension of
PCA to categorical variables. FAMD is the default method of the function
r_famd
, although a PCA is performed when all variables are numerical, which is
the case in the protein dataset. In FAMD, all outputs from PCA are kept and can
be read and interpreted in the same way. In addition, a plot dedicated to
categorical variables is output by r_plotvar
. It shows the variables’
categories (levels) in projection on the PC axes, at the barycentre of the
individuals who have them. Like numerical variables on the correlation circle,
categories can be filtered by a correlation threshold. To illustrate this
output, we will run an FAMD on the clinical dataset.
<- r_wrapp("r_famd", "clinical") out_famd
<- r_plotvar("PCAobj_2")
out_plotvar ::ggplotly(out_plotvar$Graphical$PlotCatVar) plotly
<- r_plotvar("PCAobj_2", threshold.cor = 0.5)
out_plotvar ::ggplotly(out_plotvar$Graphical$PlotCatVar) plotly