Chapter 11 PLS: Partial Least Squares

11.1 Introduction

The purpose of this document is to describe a step-by-step Partial Least Squares (PLS) workflow.

11.1.1 Used datasets

For illustrative purposes, we upload two biological datasets mrna and proteins. Note that, this method does not apply to datasets having categorical variables. We also load clinical to help interpret the results. Finally, we create a small dataset - qmrna - containing two variables, since PLS works differently in this case.

input <- "../forge/backend/R/data/mrna.csv"
r_import(input, data.name = "mrna", row.names = 1, assign = TRUE)

input <- "../forge/backend/R/data/protein.csv"
r_import(input, data.name = "proteins", sep = " ", row.names = 1, assign = TRUE)

input <- "../forge/backend/R/data/clinical.csv"
r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)

qmrna <- mrna[, 1:2]

11.1.2 Combined datasets

Prior to the PLS analysis, the datasets must be combined into single objects using the function r_combine_datasets, which is documented here.

out_comb <- r_wrapp("r_combine_datasets", list("proteins", "mrna"))
out_comb2 <- r_wrapp("r_combine_datasets", list("proteins", "qmrna"))

11.1.3 Required library

The main library loaded (mixOmics) is part of the Bioconductor repository. This package was designed in order to provide “a wide range of multivariate methods for the exploration and integration of biological datasets with a particular focus on variable selection” as explained here.

11.2 Perform PLS

First, the r_pls function is called to run the PLS by passing the following argument:

ncp: Number of variates kept in the final result. The default value is based on the dimensions of the datasets and never exceeds 20.

Then, it returns:

Table: a table with the name of the output and the mode parameter from mixOmics::pls (“canonical” if both initial datasets have at least 3 variables each, else “regression”). More information on these modes can be found here.
Graphical: a screegraph and cumulative percentage plot.
Object: the PLS object.

out_pls <- r_pls("combinedDF_1", ncp = 5)
out_pls2 <- r_pls("combinedDF_2", ncp = 5)

out_pls_wrapped <- r_wrapp("r_pls", "combinedDF_1", ncp = 5)
out_pls_wrapped2 <- r_wrapp("r_pls", "combinedDF_2", ncp = 5)

11.2.1 Outputs

Graphical outputs include a screegraph and a cumulative percentage plot. In regression mode (as defined above), only the information on the X variates is displayed.

11.2.1.1 A screegraph

In canonical mode:

plotly::ggplotly(out_pls$Graphical$Screegraph)

In regression mode:

plotly::ggplotly(out_pls2$Graphical$Screegraph)

11.2.1.2 A cumulative percentage plot

In canonical mode:

plotly::ggplotly(out_pls$Graphical$CumulativePercentagePlot)

In regression mode:

plotly::ggplotly(out_pls2$Graphical$CumulativePercentagePlot)

11.2.1.3 Object name

jsonlite::toJSON(out_pls$Table, pretty = TRUE, auto_unbox = TRUE)

## {
##   "Mode": "canonical",
##   "ObjectName": "PLSobj_1"
## }

11.2.1.4 Overview

11.3 Display individuals

Individuals can be displayed on the PLS variates using the function r_plotplsind, whose parameters are below. It outputs one graphical element.

param	short desc	class	required	default	description
datasetName	name of a PLS object	character	required	none
axis1	variate index for the x axis	integer	optional	1
axis2	variate index for the y axis	integer	optional	2
datasetcolor	Dataset for color	character	optional	NULL	Name of the dataset containing the variable to use to colour the dots
varcolor	colour variable	character	optional	NULL	Name of the variable to use to colour the dots
datasetshape	Dataset for shape	character	optional	NULL	Name of the dataset containing the variable to use to assign shapes
varshape	shape variable	character	optional	NULL	Name of the variable to use to assign shapes.
datasetsize	Dataset for size	character	optional	NULL	Name of the dataset containing the variable to use to set dot sizes
varsize	size variable	character	optional	NULL	Name of the variable to use to set dot sizes.

In canonical mode, the plot displays an average between both X-space scores and Y-space scores for the two selected axes.

out_plotind <- r_plotplsind("PLSobj_1", axis1 = 1, axis2 = 2)
plotly::ggplotly(out_plotind$Graphical$PlotInd)

out_plotind <- r_plotplsind("PLSobj_1", axis1 = 1, axis2 = 2,
                            datasetcolor = "clinical", 
                            varcolor = "patient.gender",
                            datasetsize = "clinical", 
                            varsize = "patient.age_at_initial_pathologic_diagnosis")
plotly::ggplotly(out_plotind$Graphical$PlotInd)

In regression mode, the plot only displays X-space scores for the two selected axes. Note that by default, the dots are coloured by the first Y variable and the dot sizes are set by the second Y variable (if any).

out_plotind2 <- r_plotplsind("PLSobj_2", axis1 = 1, axis2 = 2)
plotly::ggplotly(out_plotind2$Graphical$PlotInd)

out_plotind2 <- r_plotplsind("PLSobj_2", axis1 = 1, axis2 = 2,
                             datasetcolor = "clinical", 
                             varcolor = "patient.gender",
                             datasetsize = "clinical", 
                             varsize = "patient.age_at_initial_pathologic_diagnosis")
plotly::ggplotly(out_plotind2$Graphical$PlotInd)

11.4 Display variables

The correlations between the input dataset features and the chosen variates can be displayed using the function r_plotplsvar, whose parameters are below. It outputs one graphical elements.

param	short desc	class	required	default	description
datasetName	name of a PLS object	character	required	none
axis1	variate index for the x axis	integer	optional	1
axis2	variate index for the y axis	integer	optional	2
threshold.cor	correlation threshold	numerical	optional	0	Global correlation threshold (with the two axes together) to select variables to be displayed
mode	Overlap X- and Y-representation plots	character	optional	separated	If `together`, plots variables into one single graph. If `seperated`, plots variables into two individual graphs

out_plotvar <- r_plotplsvar("PLSobj_1", axis1 = 1, axis2 = 2, mode = "together")
plotly::ggplotly(out_plotvar$Graphical$PlotVar)

out_plotvar <- r_plotplsvar("PLSobj_1", mode = "separated", axis1 = 1, 
                            axis2 = 2, threshold.cor = 0.6)
plotly::ggplotly(out_plotvar$Graphical$PlotVar)

In regression mode, only the correlations in the X-space are represented.

out_plotvar <- r_plotplsvar("PLSobj_2", axis1 = 1, axis2 = 2, mode = "together")
plotly::ggplotly(out_plotvar$Graphical$PlotVar)

11.5 Extract new datasets

New datasets can be extracted and added to the user’s working space using the function r_extract_pls, whose parameters are below. The user can either create a dataset containing a set of components from each space X and Y, or a set of variables from both input datasets, selected considering a minimum correlation threshold.

param	short desc	class	required	default	description
datasetName	name of a PLS object	character	required	none
userName	Name of the object on the interface	character	optional	NULL	Cannot be a name already given to another object.
criterion	Criterion to perform extraction	character	optional	“axes”	Criterion to extract a new data.frame. Either the first PCs are extracted (“axes”) or the variables that are the most correlated with the first axes (“correlation”).
ncp	Number of variates to consider	integer	required	none	Number of variates to consider either for a direct extraction or to compute correlations with the original variables.
threshold.cor	correlation threshold	numerical	optional	0	Global correlation threshold (with several axes together) to select variables to be extracted.

The function outputs the extracted data.frame and a Table element containing the object’s name and user name. If the dataset has already been extracted, the function only modifies the user name and does not return the object.

Axes

out_extract <- r_wrapp("r_extract_pls", "PLSobj_1", criterion = "axes", ncp = 3)

Correlation

out_extract <- r_wrapp("r_extract_pls", "PLSobj_1", criterion = "correlation",
                       ncp = 3, threshold.cor = 0.5)

Note that in regression mode, only the X-space is taken into account when extracting variates or computing correlations. In this case, Y-space variates or variables are not extracted.