Chapter 12 PLS-DA: Partial Least Squares Discriminant Analysis
12.1 Introduction
The purpose of this document is to describe a step-by-step Partial Least Squares Discriminant Analysis (PLS-DA) workflow.
12.2 Used datasets
For illustration purpose, we upload one numerical datasets proteins
and a
general dataset with categorical (one will be the target variable of the
PLS-DA), clinical
.
<- "../forge/backend/R/data/protein.csv"
input r_import(input, data.name = "proteins", sep = " ", row.names = 1,
assign = TRUE)
## NULL
<- "../forge/backend/R/data/clinical.csv"
input r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)
## NULL
12.3 Combination of one numeric dataset with one target
The function r_combine_target
creates an object that can be used in any
combination analysis that predicts (or explains) one categorical variable from
a numeric dataset.
12.3.1 Inputs
X
: a dataset with numeric variables in columnsY
: a dataset with at least one categorical variabletarget
: name of the target variable chosen among \(Y\)-variables
The function runs the following checks beforehand:
- it checks that the target variable is categorical
- it checks that the target variable has more than one level
- it checks if the target variable has any missing value
<- r_wrapp("r_combine_target", X = "proteins", Y = "clinical",
out_combtg target = "patient.vital_status")
12.3.2 Outputs
This instruction returns the same type of output than the combination step of the PLS or MFA, i.e.:
Table
: two tablesdataInfoBefore
anddataInfoAfter
with descriptive statistics of the datasets before and after the combination (the common dataset is restricted to individuals common to the numerical dataset and non missing entries of the target variable),Graphical
: an Upset plot,UpsetPlot
, and a Venn diagram,VennPlot
, that provide visualizations of the number common and specific individuals between the numerical dataset and non missing entries of the target variable.
In addition, a new object is created in the workspace, which is a list with three entries (in addition to the two entries previously described):
dataGenInfo
that provides general information on the two input datasets:
$dataGenInfo multipleDA_1
## name nrow ncol nbnum nbcat logt normalized
## proteins proteins 379 142 142 0
## clinical clinical 989 217 94 123
Individuals
(a vector with common individuals’ names):
head(multipleDA_1$Individuals)
## [1] "A0SH" "A0SJ" "A0SK" "A0SO" "A04N" "A04P"
TargetInfo
that list the name of the two datasets and of the target variable:
$TargetInfo multipleDA_1
## X Y target
## "proteins" "clinical" "patient.vital_status"
12.4 Perform PLS-DA
12.4.1 Inputs
First, the function r_plsda
is called to run the PLS-DA by passing the
following arguments:
datasetName
: the name of the combined dataset as described in the previous section (this name is given in the entry$Table$ObjectName
of the previous step),ncp
: Number of loadings kept in the final result. Default value is set to the minimum between the number of variables and the number of individuals inX
(numerical dataset),seed
: Random seed generator for reproducibility.
The user defines the maximum number of components to keep. However, the optimal
number of PLS-DA components to keep can be chosen using the ErrorRatePlot
.
12.4.2 Outputs
Then, it returns:
Table
: a table withObjectName
,Graphical
: three plots:ErrorRatePlot
, the plot with the evolution of the error rate for a number or components varying from 1 toncp
(the 5-fold CV error rate is computed with fold generation fixed withseed
),Screegraph
, the screegraph, andCumulativePercentagePlot
, the plot of the percentage of reproduced inertia on PLS-DA components,Object
: the PLS object. If called usingr_wrapp
, an object namedPLSDAobj_ID
is added to the global environment.
<- r_plsda("multipleDA_1", ncp = 20, seed = 5) out_plsda
<- r_wrapp("r_plsda", "multipleDA_1", ncp = 20, seed = 5) out_plsda_wrapped
12.5 Display individuals
The individual plot displays the projection of individuals for the computed
components, which are supposed to discriminate at best between the different
levels of the traget variables. It is performed with the function r_plotplsind
and, by default, displays the levels of the target variable in color. Colors,
shapes and sizes can be further customized as for the 5-variable plot of
exploratory analyses.
<- r_plotplsind("PLSDAobj_1")
out_plot ::ggplotly(out_plot$Graphical$PlotInd) plotly
<- r_plotplsind("PLSDAobj_1",
out_plot axis1 = 3, axis2 = 4,
datasetcolor = "clinical", varcolor = "patient.gender",
datasetsize = "clinical",
varsize = "patient.age_at_initial_pathologic_diagnosis")
::ggplotly(out_plot$Graphical$PlotInd) plotly
12.6 Display variables
Variables are displayed using their correlations with corresponding PLS-DA
components using the function r_plotplsvar
. A correlation threshold,
threshold.cor
, can be specified to display only the most correlated variables.
Note that the option mode
can not be used for PLS-DA (it is only valid for
PLS).
<- r_plotplsvar("PLSDAobj_1")
out_plot ::ggplotly(out_plot$Graphical$PlotVar) plotly
<- r_plotplsvar("PLSDAobj_1",
out_plot threshold.cor = 0.6)
::ggplotly(out_plot$Graphical$PlotVar) plotly
12.7 Extract new datasets
New datasets can be extracted from this analysis and added to the user’s
workspace. The user can either create a dataset containing a set of components
or a set of variables selected in \(X\), with the largest correlation with the
components (these are the variables considered the most discriminant between the
different levels of the target variables). The option between the two choices
is set using criterion
, that can be either "axes"
or "correlation"
. For
both choices, the number of considered components is set with ncp
and the
correlation threshold is specified for the option "correlation"
with the
argument threshold.cor
.