Chapter 12 PLS-DA: Partial Least Squares Discriminant Analysis

12.1 Introduction

The purpose of this document is to describe a step-by-step Partial Least Squares Discriminant Analysis (PLS-DA) workflow.

12.2 Used datasets

For illustration purpose, we upload one numerical datasets proteins and a general dataset with categorical (one will be the target variable of the PLS-DA), clinical.

input <- "../forge/backend/R/data/protein.csv"
r_import(input, data.name = "proteins", sep = " ", row.names = 1,
         assign = TRUE)

## NULL

input <- "../forge/backend/R/data/clinical.csv"
r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)

## NULL

12.3 Combination of one numeric dataset with one target

The function r_combine_target creates an object that can be used in any combination analysis that predicts (or explains) one categorical variable from a numeric dataset.

12.3.1 Inputs

X: a dataset with numeric variables in columns
Y: a dataset with at least one categorical variable
target: name of the target variable chosen among $Y$-variables

The function runs the following checks beforehand:

it checks that the target variable is categorical
it checks that the target variable has more than one level
it checks if the target variable has any missing value

out_combtg <- r_wrapp("r_combine_target", X = "proteins", Y =  "clinical", 
                      target = "patient.vital_status")

12.3.2 Outputs

This instruction returns the same type of output than the combination step of the PLS or MFA, i.e.:

Table: two tables dataInfoBefore and dataInfoAfter with descriptive statistics of the datasets before and after the combination (the common dataset is restricted to individuals common to the numerical dataset and non missing entries of the target variable),
Graphical: an Upset plot, UpsetPlot, and a Venn diagram, VennPlot, that provide visualizations of the number common and specific individuals between the numerical dataset and non missing entries of the target variable.

In addition, a new object is created in the workspace, which is a list with three entries (in addition to the two entries previously described):

dataGenInfo that provides general information on the two input datasets:

multipleDA_1$dataGenInfo

##              name nrow ncol nbnum nbcat logt normalized
## proteins proteins  379  142   142     0                
## clinical clinical  989  217    94   123

Individuals (a vector with common individuals’ names):

head(multipleDA_1$Individuals)

## [1] "A0SH" "A0SJ" "A0SK" "A0SO" "A04N" "A04P"

TargetInfo that list the name of the two datasets and of the target variable:

multipleDA_1$TargetInfo

##                      X                      Y                 target 
##             "proteins"             "clinical" "patient.vital_status"

12.4 Perform PLS-DA

12.4.1 Inputs

First, the function r_plsda is called to run the PLS-DA by passing the following arguments:

datasetName: the name of the combined dataset as described in the previous section (this name is given in the entry $Table$ObjectName of the previous step),
ncp: Number of loadings kept in the final result. Default value is set to the minimum between the number of variables and the number of individuals in X (numerical dataset),
seed: Random seed generator for reproducibility.

The user defines the maximum number of components to keep. However, the optimal number of PLS-DA components to keep can be chosen using the ErrorRatePlot.

12.4.2 Outputs

Then, it returns:

Table: a table with ObjectName,
Graphical: three plots: ErrorRatePlot, the plot with the evolution of the error rate for a number or components varying from 1 to ncp (the 5-fold CV error rate is computed with fold generation fixed with seed), Screegraph, the screegraph, and CumulativePercentagePlot, the plot of the percentage of reproduced inertia on PLS-DA components,
Object: the PLS object. If called using r_wrapp, an object named PLSDAobj_ID is added to the global environment.

out_plsda <- r_plsda("multipleDA_1", ncp = 20, seed = 5)

out_plsda_wrapped <- r_wrapp("r_plsda", "multipleDA_1", ncp = 20, seed = 5)

12.4.2.1 Screegraph

out_plsda$Graphical$Screegraph

12.4.2.2 Classification error rate plot

out_plsda$Graphical$ErrorRatePlot

12.4.2.3 Cumulative percentage plot

out_plsda$Graphical$CumulativePercentagePlot

12.4.2.4 Object name

jsonlite::toJSON(out_plsda$Table, pretty = TRUE, auto_unbox = TRUE)

## {
##   "ObjectName": "PLSDAobj_1"
## }

12.5 Display individuals

The individual plot displays the projection of individuals for the computed components, which are supposed to discriminate at best between the different levels of the traget variables. It is performed with the function r_plotplsind and, by default, displays the levels of the target variable in color. Colors, shapes and sizes can be further customized as for the 5-variable plot of exploratory analyses.

out_plot <- r_plotplsind("PLSDAobj_1")
plotly::ggplotly(out_plot$Graphical$PlotInd)

out_plot <-  r_plotplsind("PLSDAobj_1",
                          axis1 = 3, axis2 = 4,
                          datasetcolor = "clinical", varcolor = "patient.gender",
                          datasetsize = "clinical", 
                          varsize = "patient.age_at_initial_pathologic_diagnosis")
plotly::ggplotly(out_plot$Graphical$PlotInd)

12.6 Display variables

Variables are displayed using their correlations with corresponding PLS-DA components using the function r_plotplsvar. A correlation threshold, threshold.cor, can be specified to display only the most correlated variables. Note that the option mode can not be used for PLS-DA (it is only valid for PLS).

out_plot <- r_plotplsvar("PLSDAobj_1")
plotly::ggplotly(out_plot$Graphical$PlotVar)

out_plot <- r_plotplsvar("PLSDAobj_1",
                         threshold.cor = 0.6)
plotly::ggplotly(out_plot$Graphical$PlotVar)

12.7 Extract new datasets

New datasets can be extracted from this analysis and added to the user’s workspace. The user can either create a dataset containing a set of components or a set of variables selected in $X$, with the largest correlation with the components (these are the variables considered the most discriminant between the different levels of the target variables). The option between the two choices is set using criterion, that can be either "axes" or "correlation". For both choices, the number of considered components is set with ncp and the correlation threshold is specified for the option "correlation" with the argument threshold.cor.

12.7.1 Axes

out_extract <- r_wrapp("r_extract_pls", "PLSDAobj_1", criterion = "axes", 
                       ncp = 3)

12.7.2 Correlation

out_extract <- r_wrapp("r_extract_pls", "PLSDAobj_1", criterion = "correlation",
                       ncp = 3, threshold.cor = 0.7)