Chapter 3 Workspace handling

This chapter explains how objects are managed and named into the (global) R environment. It starts by performing a few analyses that will create the datasets associated to the workspace management (a data.frame describing the content of the workspace and its associated DAG).

## PCA
out_import1 <- r_wrapp("r_import", 
                       input = "../forge/backend/R/data/protein.csv",
                       data.name = "proteins",
                       nature = "protein quantification", header = TRUE,
                       row.names = 1, sep = " ", normalized = "yes")
out_import2 <- r_wrapp("r_import", 
                       input = "../forge/backend/R/data/clinical.csv",
                       data.name = "clinical", nature = "phenotypes",
                       header = TRUE, row.names = 1, sep = ",")
out_pca <- r_wrapp("r_famd", "proteins", ncp = 10, pca=TRUE)
out_extract <- r_wrapp("r_extract_obj", "PCAobj_1", userName = "myPCs", 
                       criterion = "axes", ncp = 5)
out_extract <- r_wrapp("r_extract_obj", "PCAobj_1", userName = "myselVars", 
                       criterion = "correlation", ncp = 2, threshold.cor = 0.7)

## Clustering
### HAC
out_clustering <- r_wrapp("r_clustering", "proteins", "hac")
out_cutclustering <- r_wrapp("r_cut_clustering", "HACClustering_1", k = 5)

out_extractclustering <- r_wrapp("r_extract_dataset", "HACClusters_1",
                                 "myclusters")

### k-means
out_clustering <- r_wrapp("r_clustering", "proteins", method = "kmeans",
                          kmin = 3, kmax = 6, seed = 273)
out_cutclustering <- r_wrapp("r_cut_clustering", "KmeansCut_3")

out_extractclustering <- r_wrapp("r_extract_dataset", "KmeansClusters_1",
                                 "myaltclusters")

out_clustering <- r_wrapp("r_clustering", "proteins", method = "kmeans",
                          kmin = 6, kmax = 7, seed = 273)

## Combined dataset
out_import3 <- r_wrapp("r_import", input = "../forge/backend/R/data/mrna.csv",
                       data.name = "mrna", nature = "gene expression",
                       header = TRUE, row.names = 1, logt = "yes")
out_combined <- r_wrapp("r_combine_datasets",
                        datasetNames = list("proteins", "mrna"))

## MFA
out_mfa <- r_wrapp("r_mfa", datasetName = "combinedDF_1")

3.1 The why and how of workspace management

Due to constraints for the communication between R and the server that manages the web interface, we can only have integers, characters or simple lists as inputs of our functions. Hence, data.frame can not be used. To handle that problem, we have decided to:

pass all objects using their names instead of the object itself (hence, the function then calls the object, stored in the global R environment, using get);
output 2 types of variables in a complex list made of 3 elements (Object, Graphical and Table) : Graphical and Table elements are never stored in the global R environment but are converted to JSON format and sent to the interface. Object elements are never sent to the interface but are stored in the global environment and documented in a dedicated data.frame called object_db;
for the interface to be able to use R objects available in the global environment, two policies are used: the first, when two functions use as input the output of the other, is to export the name of the elements of Object as stored in the global R environment. This (these) name(s) is (are) given as an entry ObjectName in Table. Sometimes (when relevant) a UserName, defined by the user, is also returned in the same way. The second policy is to regularly send information about the different objects to the interface under the form of a DAG (Directed Acyclic Graph) in JSON format.

The details of this technical constraints is detailed below.

3.2 Object management

All objects (datasets or outputs of the analyses that are worth saving) are managed using the same process illustrated below:

Overview of workspace management

More precisely, all functions are called by the interface using the central function r_wrapp. The functions return a list composed of three elements (no one is mandatory), Table and Graphical (further converted to JSON and returned by r_wrapp to the interface) and Object (never returned by r_wrapp). Any element of the Object list is an R object meant to be stored for future use. To do so, r_wrapp runs the function r_create_object on each of the element of the Object list, which performs the following operations:

the object is created in the global environment
the information about the object is stored in a global data.frame called object_db
the corresponding DAG is extracted and stored in a global variable called graph_db

These different steps are described in the next sections: Section “Object creation” describes the naming convention for the objects and how they are created in the global environment, Section “Object information” describes the way the objects are described in object_db and Section “DAG generation” describes how the DAG is extracted from this information.

Retro-compatibility of objects is ensured using the following workflow: * each time a new object is created, all objects listed in object_db are saved in a file rsession.Rdata, which also contains the global variables object_db, cur_id, and (if it exists) num_obj; * each time a session is started, the rsession.Rdata is loaded and the workspace is scanned and corrected with the function r_upgrade_session.

3.3 Object creation

Objects are all named using an identical scheme, which is handled by the function r_setattr_objects, called with

r_setattr_objects(object, object.name, user.name = NULL, nature = NULL, 
                  type = c("analysis", "dataset", "hidden", "savedPlots"), func.args,
                  origin.dataset, parent.edge = NULL, groups = NULL, 
                  logt = NULL, normalized = NULL)

Within any analysis performed with a function r_function, the function r_setattr_objects is called to add attributes to all elements included in the Object list. The set attributes are:

object_name: except for object imported by user (which are named from the user’s choice), objects are named base_NUM where base is a keyword indicating the type of object (similar or identical to nature) and NUM is a unique number. These names can be created with the call to the function r_make_name(base, add = 0) which relies on the global variable num_obj that stores the number of already existing objects with the same base:

num_obj

##            PCAobj           PCAaxes            PCAsel     HACClustering 
##                 1                 1                 1                 1 
##    HACCutClusters       HACClusters  KmeansClustering         KmeansCut 
##                 1                 1                 2                 5 
## KmeansCutClusters    KmeansClusters        combinedDF            MFAobj 
##                 1                 1                 1                 1

The call to this function

r_make_name("combinedDF")

## [1] "combinedDF_2"

creates a new name but does not update num_obj (which is only updated once this object is created by the function r_create_object).

user_name: an optional name that correspond to the name given to the object by the user. If not provided, it is automatically set to object.name. A check is systematically performed to ensure that this user.name is unique.

attr(proteins, "object_name")

## [1] "proteins"

attr(proteins, "user_name")

## [1] "proteins"

attr(PCAaxes_1, "object_name")

## [1] "PCAaxes_1"

attr(PCAaxes_1, "user_name")

## [1] "myPCs"

nature: nature can either (for datasets) describe the nature of dataset, as specified by the user or (mostly for analyses) indicate the nature of R object. It is different from the object class (which is automatically stored in object_db: see next section) and has to be thought as a keyword defined by developers to identify the type of output and analysis that has created it. A function can produce object of only one nature (e.g., r_pca outputs PCAobj objects) or of multiple nature depending of one of its argument (e.g, r_extract_obj applied to a PCAobj object returns either nature PCAaxes or a dataset with the same nature than the original dataset used as input of the PCA; r_cut_clustering simultaneously returns objects of type KmeansClusters, a data.frame with the clusters and of type KmeansAddClusters, containing graphical explorations of these clusters).

attr(proteins, "nature")

## [1] "protein quantification"

attr(PCAaxes_1, "nature")

## [1] "PCAaxes"

attr(PCAsel_1, "nature")

## [1] "protein quantification"

If nature is not provided to the function, it is set to none but this behavior is mostly meant to be used for testing purpose and not in production.

type gives the object type, either "dataset", "analysis" (default value), "hidden" (for object that are not meant to be displayed in the DAG) or "savedPlots" (for objects that are not meant to be displayed in the DAG but will appear in the analysis table).

attr(proteins, "type")

## [1] "dataset"

attr(PCAobj_1, "type")

## [1] "analysis"

attr(PCAsel_1, "type")

## [1] "dataset"

attr(KmeansClustering_1, "type")

## [1] "analysis"

attr(KmeansClusters_1, "type")

## [1] "dataset"

func_args gives the named list of input arguments passed to the function that has created the object. This list is not necessarily the exact list that has been passed by the user to the function but is a unique way (depending of the function) to handle inputs after default or wrong values have been corrected by the function. This list must correspond to a bijection between objects and argument lists so as not to replay analyses when they already exist.
origin_dataset (formerly, parent.dataset) describes the imported or created dataset(s) on which the analysis is based. It is used to create titles in plots, for instance. If the analysis is an integrated analysis, based on multiple datasets, those are stored in a vector. If the object has been imported by the user, the attribute is set none (which is also the default value when this argument is not passed to r_setattr_object).

attr(proteins, "origin_dataset")

## [1] "none"

attr(PCAobj_1, "origin_dataset")

## [1] "proteins"

attr(PCAsel_1, "origin_dataset")

## [1] "proteins"

attr(KmeansClustering_1, "origin_dataset")

## [1] "proteins"

parent_edge is the attribute used to define the edge in the DAG extracted from the workspace. By default, it is set to the same value than origin_dataset but can sometimes differ from this value when it is more intuitive to use something else from the user’s point of view. This argument can be either a vector (when multiple parents are simultaneously required as in integrative analyses) or a list (when the object can derive from one parent or the other as in $k$-means clustering).

attr(proteins, "parent_edge")

## [1] "none"

attr(PCAobj_1, "parent_edge")

## [1] "proteins"

attr(PCAsel_1, "parent_edge")

## [1] "PCAobj_1"

attr(KmeansCut_4, "parent_edge")

## [[1]]
## [1] "KmeansClustering_1"
## 
## [[2]]
## [1] "KmeansClustering_2"

attr(combinedDF_1, "parent_edge")

## [1] "mrna"     "proteins"

groups (only used for MFA objects) contain the names of the original datasets used in this analysis (used for plots):

attr(MFAobj_1, "groups")

## [1] "mrna"     "proteins"

attr(proteins, "groups")

## NULL

normalized and logt are attributes indicated in the dataset is already normalized and/or log-transformed. For certain analysis, this is an heritable information (extraction of selected variables from PCA analysis for instance) and for other, this is an attribute set to "nc" (not concerned):

attr(proteins, "normalized")

## [1] "yes"

attr(mrna, "normalized")

## [1] "no"

attr(MFAobj_1, "normalized")

## [1] "nc"

attr(proteins, "logt")

## [1] "no"

attr(mrna, "logt")

## [1] "yes"

attr(MFAobj_1, "logt")

## [1] "nc"

For objects created by a given function r_function, the attributes nature, type, func_args, parent_dataset (+ also sometimes parent_edge) and type are hard coded in this function. user_name can, in addition, be set if the object is to be used as a dataset later by the user.

Note that, when passed to r_set_attr, the method first checks that an object with the same object_name exists, in which case it either stops the creation with an error. Update of object attributes are directly managed in the function (when necessary) with attr(object, ...) <- ... and returning the object (in which case, r_create_object finds that the object already exists and updates object_db with its new attributes).

3.4 Re-using an already performed analysis

If an analysis is called twice with the same arguments, the function does not recompute the output but simply reload the stored results. This is performed by

first checking, within every analysis function, that the analysis has been (or not) performed already. This part is managed with the function r_call_datadb that searches for matching entries in object_db from the attributes passed in inputs (usually function name and argument list);
if the analysis already exists, the corresponding object is loaded within the function (with get on its object_name or on the name of its corresponding entry in object_db). Note that this object exactly corresponds to one of the elements of Object as exported from the first call of the same analysis;
results of the analysis as needed by the interface are thus re-constructed from this object. It means that the loaded Object must contain information to retrieve Graphical and Table entries. The output of the function is thus created as a complex list with only objects that needs to be converted to JSON (elements of Graphical and Table but no entry Object is returned in this case). If needed, Table$ObjectName is also created.

This means that the elements in Graphical and Table in the first run of the analysis are also copied in one of the element of Object. Two approches are used to perform that operation:

in most cases, they are added as sub-objects $Graphical and $Table. For instance, for r_pca, Object contains only one element that is $PCAObj (results of the function FactoMineR::PCA) which is augmented with $PCAObj$Graphical (not Table is exported by this function. This object is further assigned to the global R environment by r_wrapp and indeed contains enough information to re-built $Graphical and $Table outputs of

names(PCAobj_1)

## [1] "eig"       "var"       "ind"       "svd"       "call"      "Graphical"
## [7] "Table"

when the Object element is a data.frame, the same approach would break the structure of the initial object (in ASTERICS data.frame are lists of vectors). In this case, this object is set to "hidden" until it is used for extraction (where it becomes a "dataset") and the analysis object contains only Graphical and Table elements. This is the case, for instance, of the output of the function r_cut_clustering :

attr(KmeansCutClusters_1, "type")

## [1] "analysis"

names(KmeansCutClusters_1)

## [1] "Table"     "Graphical"

attr(KmeansClusters_1, "type")

## [1] "dataset"

head(KmeansClusters_1)

##      KmeansClusters
## A0SH              1
## A0SJ              2
## A0SK              5
## A0SO              5
## A04N              4
## A04P              5

Finally, note that the element $Table$ObjectName and $Table$UserName are the only one that are not saved within one of the elements of Object.

3.5 Object information

Information about objects, in addition to attributes, is kept in a single list, also managed by r_create_object that is updated each time an object is created (or removed by r_delete_object). This list is the object object_db from the global environment and it is updated with the global integer cur_id that is equal to the number of already created objects (if any) plus 1 and with the global vector num_obj, already described before.

The kept information are:

jsonview::json_tree_view(object_db[1-2], scroll = TRUE)

names(object_db[[1]])

##  [1] "id"                    "object_name"           "user_name"            
##  [4] "nature"                "origin_dataset"        "parent_edge"          
##  [7] "type"                  "dataset_info"          "func_name"            
## [10] "func_args"             "groups"                "logt"                 
## [13] "normalized"            "asterics_code_version"

where each element of the list has entries identical to the attributes of the object except for id, dataset_info and func_name that are new attributes: the first is a unique identifier, kept as is, even when objects are deleted. dataset_info contains an entry class that is the R classes of the object, separated with the sign / when an object has multiple classes, and, when the object is a data.frame it has an entry meta containing 7 values automatically set by r_create_datadesc (nrow, ncol, nbmissing, nbnum, nbcat, nblogic, nbothers). The last entry (func_name) contains the name of the function that has been used to create the object.

3.6 Saved plots (or analysis)

The analysis are exportable in html using the “Export report” button, on the workspace screen. The user can add customisable plots to the html reports, under this type of plot there is a button “Add to report”.

Nb. This is also the case for analysis that do not appear on the DAG : - univariate analysis - bivariate analysis - multivariate analysis (up to 5 variables) - heatmaps In the case there is a specific button “Save for report”.

When the user save a plot by clicking the “Save for report” or “Add to report”, the function r_save_for_report is called.

It add the saved plot (function_name and func_args) into a R list: outplots_db. The names of outplots_db are the object names in object_db :

t <- r_wrapp("r_save_for_report", object_name = "PCAobj_1", 
                  function_name = "r_plotind", 
                  datasetName = "PCAobj_1", 
                  axis1 = 1, 
                  axis2 = 3)
t <- r_wrapp("r_save_for_report", object_name = NULL, 
                  function_name = "r_heatmap", 
                  dataset = "proteins",
                  dendoRows = FALSE,
                  dendoCols = FALSE)

jsonview::json_tree_view(outplots_db, scroll = TRUE)

For the analysis without node in the DAG, the function r_save_for_report create an object in object_db.

These objects won’t appear in the DAG but will appear in the table of analysis.

3.7 DAG generation

From the previously described object_db data frame, a DAG is extracted that formats the exact same information into a graph format. The main function creating the graph is rutils/r_extract_wgraph. It is automatically called in r_wrapp to perform three operations:

after the object information dataset has been updated, it extracts the new version of the graph from it,
then, this object is used to update the global environment variable graph_db,
and it is also added to the output of the called function (as a new Table named Graph; the name Graph is thus forbidden for all the other analyses) and thus passed to the server as a javascript dataset.

An illustration of the use of this function with the previous object_db dataset is:

out_dag <- r_extract_wgraph()

which is converted to JSON as a list of type Workflow

jsonview::json_tree_view(out_dag, scroll = TRUE)

This list includes a data field with nodes and edges. The edge field gives an edge under the form [parent, child], where parent and child refer to the field id of the node data. The node field is more complex and provides an exhaustive information about the objects created by the different analyses. More precisely, each node contains:

generic and standardized information, including:
- id (the node unique number, corresponding to the entry id in object_db)
- object_name (the name of the object in the workspace, corresponding to the entry object.name in object_db)
- user_name (the name as known by the user, corresponding to the entry user.name in object_db)
- parent_dataset (the name(s) of the parent dataset(s), as given by the entry parent_dataset
- type (the type of the node, that can be "analysis", "dataset" or "savedPlots", as given by the entry type)
custom information, depending on type and including in a field called meta
a (possibly empty) list, called savedPlots containing the parameters of the additional plots saved by the user for this node. This informations contains the function_name and func_args for all the plots saved in outplots_db.

3.7.1 DAG information for datasets

For dataset nodes, the field meta is a list with the following information:

nature that corresponds to the entry nature of object_db
class that corresponds to the entry dataset_info$class of object_db
nrow, ncol, nbmissing, nbnum, nbcat, nblogic, and nbothers that all correspond to the entries in dataset_info$meta of the same name in object_db. This information is missing when the corresponding node is not of the class data.frame

3.7.2 DAG information for analyses

For analyses nodes, the field meta contains information that can be used to replay the exact same analysis. More precisely, it contains three fields:

nature that corresponds to the entry nature of object_db
func_name that corresponds to the entry func.name of object_db
func_args that corresponds to the entry func_args of object_db