Chapter 11 PLS: Partial Least Squares
11.1 Introduction
The purpose of this document is to describe a step-by-step Partial Least Squares (PLS) workflow.
11.1.1 Used datasets
For illustrative purposes, we upload two biological datasets mrna
and proteins
. Note that, this method does not apply to datasets having categorical variables. We also load clinical
to help interpret the results. Finally, we create a small dataset - qmrna
- containing two variables, since PLS works differently in this case.
<- "../forge/backend/R/data/mrna.csv"
input r_import(input, data.name = "mrna", row.names = 1, assign = TRUE)
<- "../forge/backend/R/data/protein.csv"
input r_import(input, data.name = "proteins", sep = " ", row.names = 1, assign = TRUE)
<- "../forge/backend/R/data/clinical.csv"
input r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)
<- mrna[, 1:2] qmrna
11.1.2 Combined datasets
Prior to the PLS analysis, the datasets must be combined into single objects using the function r_combine_datasets
, which is documented here.
<- r_wrapp("r_combine_datasets", list("proteins", "mrna"))
out_comb <- r_wrapp("r_combine_datasets", list("proteins", "qmrna")) out_comb2
11.1.3 Required library
The main library loaded (mixOmics
) is part of the Bioconductor
repository. This package was designed in order to provide “a wide range of multivariate methods for the exploration and integration of biological datasets with a particular focus on variable selection” as explained here.
11.2 Perform PLS
First, the r_pls
function is called to run the PLS by passing the following argument:
- ncp: Number of variates kept in the final result. The default value is based on the dimensions of the datasets and never exceeds 20.
Then, it returns:
- Table: a table with the name of the output and the
mode
parameter frommixOmics::pls
(“canonical” if both initial datasets have at least 3 variables each, else “regression”). More information on these modes can be found here.
- Graphical: a screegraph and cumulative percentage plot.
- Object: the PLS object.
<- r_pls("combinedDF_1", ncp = 5)
out_pls <- r_pls("combinedDF_2", ncp = 5) out_pls2
<- r_wrapp("r_pls", "combinedDF_1", ncp = 5)
out_pls_wrapped <- r_wrapp("r_pls", "combinedDF_2", ncp = 5) out_pls_wrapped2
11.2.1 Outputs
Graphical outputs include a screegraph and a cumulative percentage plot. In regression mode (as defined above), only the information on the X variates is displayed.
11.2.1.1 A screegraph
In canonical mode:
::ggplotly(out_pls$Graphical$Screegraph) plotly
In regression mode:
::ggplotly(out_pls2$Graphical$Screegraph) plotly
11.2.1.2 A cumulative percentage plot
In canonical mode:
::ggplotly(out_pls$Graphical$CumulativePercentagePlot) plotly
In regression mode:
::ggplotly(out_pls2$Graphical$CumulativePercentagePlot) plotly
11.3 Display individuals
Individuals can be displayed on the PLS variates using the function r_plotplsind
, whose parameters are below. It outputs one graphical element.
param | short desc | class | required | default | description |
---|---|---|---|---|---|
datasetName | name of a PLS object | character | required | none | |
axis1 | variate index for the x axis | integer | optional | 1 | |
axis2 | variate index for the y axis | integer | optional | 2 | |
datasetcolor | Dataset for color | character | optional | NULL | Name of the dataset containing the variable to use to colour the dots |
varcolor | colour variable | character | optional | NULL | Name of the variable to use to colour the dots |
datasetshape | Dataset for shape | character | optional | NULL | Name of the dataset containing the variable to use to assign shapes |
varshape | shape variable | character | optional | NULL | Name of the variable to use to assign shapes. |
datasetsize | Dataset for size | character | optional | NULL | Name of the dataset containing the variable to use to set dot sizes |
varsize | size variable | character | optional | NULL | Name of the variable to use to set dot sizes. |
In canonical mode, the plot displays an average between both X-space scores and Y-space scores for the two selected axes.
<- r_plotplsind("PLSobj_1", axis1 = 1, axis2 = 2)
out_plotind ::ggplotly(out_plotind$Graphical$PlotInd) plotly
<- r_plotplsind("PLSobj_1", axis1 = 1, axis2 = 2,
out_plotind datasetcolor = "clinical",
varcolor = "patient.gender",
datasetsize = "clinical",
varsize = "patient.age_at_initial_pathologic_diagnosis")
::ggplotly(out_plotind$Graphical$PlotInd) plotly
In regression mode, the plot only displays X-space scores for the two selected axes. Note that by default, the dots are coloured by the first Y variable and the dot sizes are set by the second Y variable (if any).
<- r_plotplsind("PLSobj_2", axis1 = 1, axis2 = 2)
out_plotind2 ::ggplotly(out_plotind2$Graphical$PlotInd) plotly
<- r_plotplsind("PLSobj_2", axis1 = 1, axis2 = 2,
out_plotind2 datasetcolor = "clinical",
varcolor = "patient.gender",
datasetsize = "clinical",
varsize = "patient.age_at_initial_pathologic_diagnosis")
::ggplotly(out_plotind2$Graphical$PlotInd) plotly
11.4 Display variables
The correlations between the input dataset features and the chosen variates can be displayed using the function r_plotplsvar
, whose parameters are below. It outputs one graphical elements.
param | short desc | class | required | default | description |
---|---|---|---|---|---|
datasetName | name of a PLS object | character | required | none | |
axis1 | variate index for the x axis | integer | optional | 1 | |
axis2 | variate index for the y axis | integer | optional | 2 | |
threshold.cor | correlation threshold | numerical | optional | 0 | Global correlation threshold (with the two axes together) to select variables to be displayed |
mode | Overlap X- and Y-representation plots | character | optional | separated | If together , plots variables into one single graph. If seperated , plots variables into two individual graphs |
<- r_plotplsvar("PLSobj_1", axis1 = 1, axis2 = 2, mode = "together")
out_plotvar ::ggplotly(out_plotvar$Graphical$PlotVar) plotly
<- r_plotplsvar("PLSobj_1", mode = "separated", axis1 = 1,
out_plotvar axis2 = 2, threshold.cor = 0.6)
::ggplotly(out_plotvar$Graphical$PlotVar) plotly
In regression mode, only the correlations in the X-space are represented.
<- r_plotplsvar("PLSobj_2", axis1 = 1, axis2 = 2, mode = "together")
out_plotvar ::ggplotly(out_plotvar$Graphical$PlotVar) plotly
11.5 Extract new datasets
New datasets can be extracted and added to the user’s working space using the function r_extract_pls
, whose parameters are below. The user can either create a dataset containing a set of components from each space X and Y, or a set of variables from both input datasets, selected considering a minimum correlation threshold.
param | short desc | class | required | default | description |
---|---|---|---|---|---|
datasetName | name of a PLS object | character | required | none | |
userName | Name of the object on the interface | character | optional | NULL | Cannot be a name already given to another object. |
criterion | Criterion to perform extraction | character | optional | “axes” | Criterion to extract a new data.frame. Either the first PCs are extracted (“axes”) or the variables that are the most correlated with the first axes (“correlation”). |
ncp | Number of variates to consider | integer | required | none | Number of variates to consider either for a direct extraction or to compute correlations with the original variables. |
threshold.cor | correlation threshold | numerical | optional | 0 | Global correlation threshold (with several axes together) to select variables to be extracted. |
The function outputs the extracted data.frame and a Table
element containing the object’s name and user name. If the dataset has already been extracted, the function only modifies the user name and does not return the object.
- Axes
<- r_wrapp("r_extract_pls", "PLSobj_1", criterion = "axes", ncp = 3) out_extract
- Correlation
<- r_wrapp("r_extract_pls", "PLSobj_1", criterion = "correlation",
out_extract ncp = 3, threshold.cor = 0.5)
Note that in regression mode, only the X-space is taken into account when extracting variates or computing correlations. In this case, Y-space variates or variables are not extracted.