Chapter 3 Workspace handling
This chapter explains how objects are managed and named into the (global) R
environment. It starts by performing a few analyses that will create the
datasets associated to the workspace management (a data.frame
describing the
content of the workspace and its associated DAG).
## PCA
<- r_wrapp("r_import",
out_import1 input = "../forge/backend/R/data/protein.csv",
data.name = "proteins",
nature = "protein quantification", header = TRUE,
row.names = 1, sep = " ", normalized = "yes")
<- r_wrapp("r_import",
out_import2 input = "../forge/backend/R/data/clinical.csv",
data.name = "clinical", nature = "phenotypes",
header = TRUE, row.names = 1, sep = ",")
<- r_wrapp("r_famd", "proteins", ncp = 10, pca=TRUE)
out_pca <- r_wrapp("r_extract_obj", "PCAobj_1", userName = "myPCs",
out_extract criterion = "axes", ncp = 5)
<- r_wrapp("r_extract_obj", "PCAobj_1", userName = "myselVars",
out_extract criterion = "correlation", ncp = 2, threshold.cor = 0.7)
## Clustering
### HAC
<- r_wrapp("r_clustering", "proteins", "hac")
out_clustering <- r_wrapp("r_cut_clustering", "HACClustering_1", k = 5)
out_cutclustering
<- r_wrapp("r_extract_dataset", "HACClusters_1",
out_extractclustering "myclusters")
### k-means
<- r_wrapp("r_clustering", "proteins", method = "kmeans",
out_clustering kmin = 3, kmax = 6, seed = 273)
<- r_wrapp("r_cut_clustering", "KmeansCut_3")
out_cutclustering
<- r_wrapp("r_extract_dataset", "KmeansClusters_1",
out_extractclustering "myaltclusters")
<- r_wrapp("r_clustering", "proteins", method = "kmeans",
out_clustering kmin = 6, kmax = 7, seed = 273)
## Combined dataset
<- r_wrapp("r_import", input = "../forge/backend/R/data/mrna.csv",
out_import3 data.name = "mrna", nature = "gene expression",
header = TRUE, row.names = 1, logt = "yes")
<- r_wrapp("r_combine_datasets",
out_combined datasetNames = list("proteins", "mrna"))
## MFA
<- r_wrapp("r_mfa", datasetName = "combinedDF_1") out_mfa
3.1 The why and how of workspace management
Due to constraints for the communication between R and the server that
manages the web interface, we can only have integers, characters or simple
lists as inputs of our functions. Hence, data.frame
can not be used. To
handle that problem, we have decided to:
pass all objects using their names instead of the object itself (hence, the function then calls the object, stored in the global R environment, using
get
);output 2 types of variables in a complex list made of 3 elements (
Object
,Graphical
andTable
) :Graphical
andTable
elements are never stored in the global R environment but are converted to JSON format and sent to the interface.Object
elements are never sent to the interface but are stored in the global environment and documented in a dedicateddata.frame
calledobject_db
;for the interface to be able to use R objects available in the global environment, two policies are used: the first, when two functions use as input the output of the other, is to export the name of the elements of
Object
as stored in the global R environment. This (these) name(s) is (are) given as an entryObjectName
inTable
. Sometimes (when relevant) aUserName
, defined by the user, is also returned in the same way. The second policy is to regularly send information about the different objects to the interface under the form of a DAG (Directed Acyclic Graph) in JSON format.
The details of this technical constraints is detailed below.
3.2 Object management
All objects (datasets or outputs of the analyses that are worth saving) are managed using the same process illustrated below:
More precisely, all functions are called by the interface using the central
function r_wrapp
. The functions return a list composed of three elements (no
one is mandatory), Table
and Graphical
(further converted to JSON and
returned by r_wrapp
to the interface) and Object
(never returned by
r_wrapp
). Any element of the Object
list is an R object meant to be
stored for future use. To do so, r_wrapp
runs the function r_create_object
on each of the element of the Object
list, which performs the following
operations:
the object is created in the global environment
the information about the object is stored in a global
data.frame
calledobject_db
the corresponding DAG is extracted and stored in a global variable called
graph_db
These different steps are described in the next sections: Section “Object
creation” describes the naming convention for the objects and how they are
created in the global environment, Section “Object information” describes the
way the objects are described in object_db
and Section “DAG generation”
describes how the DAG is extracted from this information.
Retro-compatibility of objects is ensured using the following workflow:
* each time a new object is created, all objects listed in object_db
are saved
in a file rsession.Rdata
, which also contains the global variables
object_db
, cur_id
, and (if it exists) num_obj
;
* each time a session is started, the rsession.Rdata
is loaded and the
workspace is scanned and corrected with the function r_upgrade_session
.
3.3 Object creation
Objects are all named using an identical scheme, which is handled by the
function r_setattr_objects
, called with
r_setattr_objects(object, object.name, user.name = NULL, nature = NULL,
type = c("analysis", "dataset", "hidden", "savedPlots"), func.args,
parent.edge = NULL, groups = NULL,
origin.dataset, logt = NULL, normalized = NULL)
Within any analysis performed with a function
r_function
, the function r_setattr_objects
is called to add attributes to
all elements included in the Object
list. The set attributes are:
object_name
: except for object imported by user (which are named from the user’s choice), objects are namedbase_NUM
wherebase
is a keyword indicating the type of object (similar or identical tonature
) andNUM
is a unique number. These names can be created with the call to the functionr_make_name(base, add = 0)
which relies on the global variablenum_obj
that stores the number of already existing objects with the same base:
num_obj
## PCAobj PCAaxes PCAsel HACClustering
## 1 1 1 1
## HACCutClusters HACClusters KmeansClustering KmeansCut
## 1 1 2 5
## KmeansCutClusters KmeansClusters combinedDF MFAobj
## 1 1 1 1
The call to this function
r_make_name("combinedDF")
## [1] "combinedDF_2"
creates a new name but does not update num_obj
(which is only updated
once this object is created by the function r_create_object
).
user_name
: an optional name that correspond to the name given to the object by the user. If not provided, it is automatically set toobject.name
. A check is systematically performed to ensure that thisuser.name
is unique.
attr(proteins, "object_name")
## [1] "proteins"
attr(proteins, "user_name")
## [1] "proteins"
attr(PCAaxes_1, "object_name")
## [1] "PCAaxes_1"
attr(PCAaxes_1, "user_name")
## [1] "myPCs"
nature
: nature can either (for datasets) describe the nature of dataset, as specified by the user or (mostly for analyses) indicate the nature of R object. It is different from the objectclass
(which is automatically stored inobject_db
: see next section) and has to be thought as a keyword defined by developers to identify the type of output and analysis that has created it. A function can produce object of only onenature
(e.g.,r_pca
outputsPCAobj
objects) or of multiplenature
depending of one of its argument (e.g,r_extract_obj
applied to aPCAobj
object returns eithernature
PCAaxes
or a dataset with the same nature than the original dataset used as input of the PCA;r_cut_clustering
simultaneously returns objects of typeKmeansClusters
, adata.frame
with the clusters and of typeKmeansAddClusters
, containing graphical explorations of these clusters).
attr(proteins, "nature")
## [1] "protein quantification"
attr(PCAaxes_1, "nature")
## [1] "PCAaxes"
attr(PCAsel_1, "nature")
## [1] "protein quantification"
If nature
is not provided to the function, it is set to none
but this
behavior is mostly meant to be used for testing purpose and not in production.
type
gives the object type, either"dataset"
,"analysis"
(default value),"hidden"
(for object that are not meant to be displayed in the DAG) or"savedPlots"
(for objects that are not meant to be displayed in the DAG but will appear in the analysis table).
attr(proteins, "type")
## [1] "dataset"
attr(PCAobj_1, "type")
## [1] "analysis"
attr(PCAsel_1, "type")
## [1] "dataset"
attr(KmeansClustering_1, "type")
## [1] "analysis"
attr(KmeansClusters_1, "type")
## [1] "dataset"
func_args
gives the named list of input arguments passed to the function that has created the object. This list is not necessarily the exact list that has been passed by the user to the function but is a unique way (depending of the function) to handle inputs after default or wrong values have been corrected by the function. This list must correspond to a bijection between objects and argument lists so as not to replay analyses when they already exist.origin_dataset
(formerly,parent.dataset
) describes the imported or created dataset(s) on which the analysis is based. It is used to create titles in plots, for instance. If the analysis is an integrated analysis, based on multiple datasets, those are stored in a vector. If the object has been imported by the user, the attribute is setnone
(which is also the default value when this argument is not passed tor_setattr_object
).
attr(proteins, "origin_dataset")
## [1] "none"
attr(PCAobj_1, "origin_dataset")
## [1] "proteins"
attr(PCAsel_1, "origin_dataset")
## [1] "proteins"
attr(KmeansClustering_1, "origin_dataset")
## [1] "proteins"
parent_edge
is the attribute used to define the edge in the DAG extracted from the workspace. By default, it is set to the same value thanorigin_dataset
but can sometimes differ from this value when it is more intuitive to use something else from the user’s point of view. This argument can be either a vector (when multiple parents are simultaneously required as in integrative analyses) or a list (when the object can derive from one parent or the other as in \(k\)-means clustering).
attr(proteins, "parent_edge")
## [1] "none"
attr(PCAobj_1, "parent_edge")
## [1] "proteins"
attr(PCAsel_1, "parent_edge")
## [1] "PCAobj_1"
attr(KmeansCut_4, "parent_edge")
## [[1]]
## [1] "KmeansClustering_1"
##
## [[2]]
## [1] "KmeansClustering_2"
attr(combinedDF_1, "parent_edge")
## [1] "mrna" "proteins"
groups
(only used for MFA objects) contain the names of the original datasets used in this analysis (used for plots):
attr(MFAobj_1, "groups")
## [1] "mrna" "proteins"
attr(proteins, "groups")
## NULL
normalized
andlogt
are attributes indicated in the dataset is already normalized and/or log-transformed. For certain analysis, this is an heritable information (extraction of selected variables from PCA analysis for instance) and for other, this is an attribute set to"nc"
(not concerned):
attr(proteins, "normalized")
## [1] "yes"
attr(mrna, "normalized")
## [1] "no"
attr(MFAobj_1, "normalized")
## [1] "nc"
attr(proteins, "logt")
## [1] "no"
attr(mrna, "logt")
## [1] "yes"
attr(MFAobj_1, "logt")
## [1] "nc"
For objects created by a given function r_function
, the attributes nature
,
type
, func_args
, parent_dataset
(+ also sometimes parent_edge
) and
type
are hard coded in this function. user_name
can, in addition, be set if
the object is to be used as a dataset later by the user.
Note that, when passed to r_set_attr
, the method first checks that an object
with the same object_name
exists, in which case it either stops the creation
with an error. Update of object attributes are directly managed in the function
(when necessary) with attr(object, ...) <- ...
and returning the object (in
which case, r_create_object
finds that the object already exists and updates
object_db
with its new attributes).
3.4 Re-using an already performed analysis
If an analysis is called twice with the same arguments, the function does not recompute the output but simply reload the stored results. This is performed by
first checking, within every analysis function, that the analysis has been (or not) performed already. This part is managed with the function
r_call_datadb
that searches for matching entries inobject_db
from the attributes passed in inputs (usually function name and argument list);if the analysis already exists, the corresponding object is loaded within the function (with
get
on itsobject_name
or on the name of its corresponding entry inobject_db
). Note that this object exactly corresponds to one of the elements ofObject
as exported from the first call of the same analysis;results of the analysis as needed by the interface are thus re-constructed from this object. It means that the loaded
Object
must contain information to retrieveGraphical
andTable
entries. The output of the function is thus created as a complex list with only objects that needs to be converted to JSON (elements ofGraphical
andTable
but no entryObject
is returned in this case). If needed,Table$ObjectName
is also created.
This means that the elements in Graphical
and Table
in the first run of the
analysis are also copied in one of the element of Object
. Two approches are
used to perform that operation:
- in most cases, they are added as sub-objects
$Graphical
and$Table
. For instance, forr_pca
,Object
contains only one element that is$PCAObj
(results of the functionFactoMineR::PCA
) which is augmented with$PCAObj$Graphical
(notTable
is exported by this function. This object is further assigned to the global R environment byr_wrapp
and indeed contains enough information to re-built$Graphical
and$Table
outputs of
names(PCAobj_1)
## [1] "eig" "var" "ind" "svd" "call" "Graphical"
## [7] "Table"
- when the
Object
element is adata.frame
, the same approach would break the structure of the initial object (in ASTERICSdata.frame
are lists of vectors). In this case, this object is set to"hidden"
until it is used for extraction (where it becomes a"dataset"
) and the analysis object contains onlyGraphical
andTable
elements. This is the case, for instance, of the output of the functionr_cut_clustering
:
attr(KmeansCutClusters_1, "type")
## [1] "analysis"
names(KmeansCutClusters_1)
## [1] "Table" "Graphical"
attr(KmeansClusters_1, "type")
## [1] "dataset"
head(KmeansClusters_1)
## KmeansClusters
## A0SH 1
## A0SJ 2
## A0SK 5
## A0SO 5
## A04N 4
## A04P 5
Finally, note that the element $Table$ObjectName
and $Table$UserName
are
the only one that are not saved within one of the elements of Object
.
3.5 Object information
Information about objects, in addition to attributes, is kept in a single
list, also managed by r_create_object
that is updated each time an object is
created (or removed by r_delete_object
). This list is the object object_db
from the global environment and it is updated with the global integer cur_id
that is equal to the number of already created objects (if any) plus 1 and with
the global vector num_obj
, already described before.
The kept information are:
::json_tree_view(object_db[1-2], scroll = TRUE) jsonview
names(object_db[[1]])
## [1] "id" "object_name" "user_name"
## [4] "nature" "origin_dataset" "parent_edge"
## [7] "type" "dataset_info" "func_name"
## [10] "func_args" "groups" "logt"
## [13] "normalized" "asterics_code_version"
where each element of the list has entries identical to the attributes of the
object except for id
, dataset_info
and func_name
that are new attributes:
the first is a unique identifier, kept as is, even when objects are deleted.
dataset_info
contains an entry class
that is the R classes of the
object, separated with the sign /
when an object has multiple classes, and,
when the object is a data.frame
it has an entry meta
containing 7 values
automatically set by r_create_datadesc
(nrow
, ncol
, nbmissing
, nbnum
,
nbcat
, nblogic
, nbothers
). The last entry (func_name
) contains the name
of the function that has been used to create the object.
3.6 Saved plots (or analysis)
The analysis are exportable in html using the “Export report” button, on the workspace screen. The user can add customisable plots to the html reports, under this type of plot there is a button “Add to report”.
Nb. This is also the case for analysis that do not appear on the DAG : - univariate analysis - bivariate analysis - multivariate analysis (up to 5 variables) - heatmaps In the case there is a specific button “Save for report”.
When the user save a plot by clicking the “Save for report” or “Add to report”,
the function r_save_for_report
is called.
It add the saved plot (function_name and func_args) into a R list: outplots_db
.
The names of outplots_db
are the object names in object_db
:
<- r_wrapp("r_save_for_report", object_name = "PCAobj_1",
t function_name = "r_plotind",
datasetName = "PCAobj_1",
axis1 = 1,
axis2 = 3)
<- r_wrapp("r_save_for_report", object_name = NULL,
t function_name = "r_heatmap",
dataset = "proteins",
dendoRows = FALSE,
dendoCols = FALSE)
::json_tree_view(outplots_db, scroll = TRUE) jsonview
For the analysis without node in the DAG, the function r_save_for_report
create an object in object_db
.
These objects won’t appear in the DAG but will appear in the table of analysis.
3.7 DAG generation
From the previously described object_db
data frame, a DAG is extracted that
formats the exact same information into a graph format. The main function
creating the graph is rutils/r_extract_wgraph
. It is automatically called
in r_wrapp
to perform three operations:
after the object information dataset has been updated, it extracts the new version of the graph from it,
then, this object is used to update the global environment variable
graph_db
,and it is also added to the output of the called function (as a new
Table
namedGraph
; the nameGraph
is thus forbidden for all the other analyses) and thus passed to the server as a javascript dataset.
An illustration of the use of this function with the previous object_db
dataset is:
<- r_extract_wgraph() out_dag
which is converted to JSON as a list of type Workflow
::json_tree_view(out_dag, scroll = TRUE) jsonview
This list includes a data
field with nodes
and edges
. The edge field gives
an edge under the form [parent, child]
, where parent
and child
refer to
the field id
of the node data. The node field is more complex and provides an
exhaustive information about the objects created by the different analyses. More
precisely, each node contains:
generic and standardized information, including:
id
(the node unique number, corresponding to the entryid
inobject_db
)object_name
(the name of the object in the workspace, corresponding to the entryobject.name
inobject_db
)user_name
(the name as known by the user, corresponding to the entryuser.name
inobject_db
)parent_dataset
(the name(s) of the parent dataset(s), as given by the entryparent_dataset
type
(the type of the node, that can be"analysis"
,"dataset"
or"savedPlots"
, as given by the entrytype
)
custom information, depending on
type
and including in a field calledmeta
a (possibly empty) list, called
savedPlots
containing the parameters of the additional plots saved by the user for this node. This informations contains thefunction_name
andfunc_args
for all the plots saved inoutplots_db
.
3.7.1 DAG information for datasets
For dataset nodes, the field meta
is a list with the following information:
nature
that corresponds to the entrynature
ofobject_db
class
that corresponds to the entrydataset_info$class
ofobject_db
nrow
,ncol
,nbmissing
,nbnum
,nbcat
,nblogic
, andnbothers
that all correspond to the entries indataset_info$meta
of the same name inobject_db
. This information is missing when the corresponding node is not of the classdata.frame
3.7.2 DAG information for analyses
For analyses nodes, the field meta
contains information that can be used to
replay the exact same analysis. More precisely, it contains three fields:
nature
that corresponds to the entrynature
ofobject_db
func_name
that corresponds to the entryfunc.name
ofobject_db
func_args
that corresponds to the entryfunc_args
ofobject_db