Chapter 11 PLS: Partial Least Squares

11.1 Introduction

The purpose of this document is to describe a step-by-step Partial Least Squares (PLS) workflow.

11.1.1 Used datasets

For illustrative purposes, we upload two biological datasets mrna and proteins. Note that, this method does not apply to datasets having categorical variables. We also load clinical to help interpret the results. Finally, we create a small dataset - qmrna - containing two variables, since PLS works differently in this case.

input <- "../forge/backend/R/data/mrna.csv"
r_import(input, data.name = "mrna", row.names = 1, assign = TRUE)

input <- "../forge/backend/R/data/protein.csv"
r_import(input, data.name = "proteins", sep = " ", row.names = 1, assign = TRUE)

input <- "../forge/backend/R/data/clinical.csv"
r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)

qmrna <- mrna[, 1:2]

11.1.2 Combined datasets

Prior to the PLS analysis, the datasets must be combined into single objects using the function r_combine_datasets, which is documented here.

out_comb <- r_wrapp("r_combine_datasets", list("proteins", "mrna"))
out_comb2 <- r_wrapp("r_combine_datasets", list("proteins", "qmrna"))

11.1.3 Required library

The main library loaded (mixOmics) is part of the Bioconductor repository. This package was designed in order to provide “a wide range of multivariate methods for the exploration and integration of biological datasets with a particular focus on variable selection” as explained here.

11.2 Perform PLS

First, the r_pls function is called to run the PLS by passing the following argument:

  • ncp: Number of variates kept in the final result. The default value is based on the dimensions of the datasets and never exceeds 20.

Then, it returns:

  • Table: a table with the name of the output and the mode parameter from mixOmics::pls (“canonical” if both initial datasets have at least 3 variables each, else “regression”). More information on these modes can be found here.
  • Graphical: a screegraph and cumulative percentage plot.
  • Object: the PLS object.
out_pls <- r_pls("combinedDF_1", ncp = 5)
out_pls2 <- r_pls("combinedDF_2", ncp = 5)
out_pls_wrapped <- r_wrapp("r_pls", "combinedDF_1", ncp = 5)
out_pls_wrapped2 <- r_wrapp("r_pls", "combinedDF_2", ncp = 5)

11.2.1 Outputs

Graphical outputs include a screegraph and a cumulative percentage plot. In regression mode (as defined above), only the information on the X variates is displayed.

11.2.1.1 A screegraph

In canonical mode:

plotly::ggplotly(out_pls$Graphical$Screegraph)

In regression mode:

plotly::ggplotly(out_pls2$Graphical$Screegraph)

11.2.1.2 A cumulative percentage plot

In canonical mode:

plotly::ggplotly(out_pls$Graphical$CumulativePercentagePlot)

In regression mode:

plotly::ggplotly(out_pls2$Graphical$CumulativePercentagePlot)

11.2.1.3 Object name

jsonlite::toJSON(out_pls$Table, pretty = TRUE, auto_unbox = TRUE)
## {
##   "Mode": "canonical",
##   "ObjectName": "PLSobj_1"
## }

11.2.1.4 Overview

11.3 Display individuals

Individuals can be displayed on the PLS variates using the function r_plotplsind, whose parameters are below. It outputs one graphical element.

param short desc class required default description
datasetName name of a PLS object character required none
axis1 variate index for the x axis integer optional 1
axis2 variate index for the y axis integer optional 2
datasetcolor Dataset for color character optional NULL Name of the dataset containing the variable to use to colour the dots
varcolor colour variable character optional NULL Name of the variable to use to colour the dots
datasetshape Dataset for shape character optional NULL Name of the dataset containing the variable to use to assign shapes
varshape shape variable character optional NULL Name of the variable to use to assign shapes.
datasetsize Dataset for size character optional NULL Name of the dataset containing the variable to use to set dot sizes
varsize size variable character optional NULL Name of the variable to use to set dot sizes.

In canonical mode, the plot displays an average between both X-space scores and Y-space scores for the two selected axes.

out_plotind <- r_plotplsind("PLSobj_1", axis1 = 1, axis2 = 2)
plotly::ggplotly(out_plotind$Graphical$PlotInd)
out_plotind <- r_plotplsind("PLSobj_1", axis1 = 1, axis2 = 2,
                            datasetcolor = "clinical", 
                            varcolor = "patient.gender",
                            datasetsize = "clinical", 
                            varsize = "patient.age_at_initial_pathologic_diagnosis")
plotly::ggplotly(out_plotind$Graphical$PlotInd)

In regression mode, the plot only displays X-space scores for the two selected axes. Note that by default, the dots are coloured by the first Y variable and the dot sizes are set by the second Y variable (if any).

out_plotind2 <- r_plotplsind("PLSobj_2", axis1 = 1, axis2 = 2)
plotly::ggplotly(out_plotind2$Graphical$PlotInd)
out_plotind2 <- r_plotplsind("PLSobj_2", axis1 = 1, axis2 = 2,
                             datasetcolor = "clinical", 
                             varcolor = "patient.gender",
                             datasetsize = "clinical", 
                             varsize = "patient.age_at_initial_pathologic_diagnosis")
plotly::ggplotly(out_plotind2$Graphical$PlotInd)

11.4 Display variables

The correlations between the input dataset features and the chosen variates can be displayed using the function r_plotplsvar, whose parameters are below. It outputs one graphical elements.

param short desc class required default description
datasetName name of a PLS object character required none
axis1 variate index for the x axis integer optional 1
axis2 variate index for the y axis integer optional 2
threshold.cor correlation threshold numerical optional 0 Global correlation threshold (with the two axes together) to select variables to be displayed
mode Overlap X- and Y-representation plots character optional separated If together, plots variables into one single graph. If seperated, plots variables into two individual graphs
out_plotvar <- r_plotplsvar("PLSobj_1", axis1 = 1, axis2 = 2, mode = "together")
plotly::ggplotly(out_plotvar$Graphical$PlotVar)
out_plotvar <- r_plotplsvar("PLSobj_1", mode = "separated", axis1 = 1, 
                            axis2 = 2, threshold.cor = 0.6)
plotly::ggplotly(out_plotvar$Graphical$PlotVar)


In regression mode, only the correlations in the X-space are represented.

out_plotvar <- r_plotplsvar("PLSobj_2", axis1 = 1, axis2 = 2, mode = "together")
plotly::ggplotly(out_plotvar$Graphical$PlotVar)

11.5 Extract new datasets

New datasets can be extracted and added to the user’s working space using the function r_extract_pls, whose parameters are below. The user can either create a dataset containing a set of components from each space X and Y, or a set of variables from both input datasets, selected considering a minimum correlation threshold.

param short desc class required default description
datasetName name of a PLS object character required none
userName Name of the object on the interface character optional NULL Cannot be a name already given to another object.
criterion Criterion to perform extraction character optional “axes” Criterion to extract a new data.frame. Either the first PCs are extracted (“axes”) or the variables that are the most correlated with the first axes (“correlation”).
ncp Number of variates to consider integer required none Number of variates to consider either for a direct extraction or to compute correlations with the original variables.
threshold.cor correlation threshold numerical optional 0 Global correlation threshold (with several axes together) to select variables to be extracted.

The function outputs the extracted data.frame and a Table element containing the object’s name and user name. If the dataset has already been extracted, the function only modifies the user name and does not return the object.

  • Axes
out_extract <- r_wrapp("r_extract_pls", "PLSobj_1", criterion = "axes", ncp = 3)
  • Correlation
out_extract <- r_wrapp("r_extract_pls", "PLSobj_1", criterion = "correlation",
                       ncp = 3, threshold.cor = 0.5)

Note that in regression mode, only the X-space is taken into account when extracting variates or computing correlations. In this case, Y-space variates or variables are not extracted.