Chapter 4 Dataset edition

The main function is r_edit_dataset and it provides several ways to modify a dataset:

  • transpose (action: "transpose")
  • change dataset nature (action: "set_dataset_nature" + dataset_nature)
  • subset rows (action: "subset" + row_names)
  • subset columns (action: "subset" + column_names)
  • change column types (action: "set_columns_type" + columns_type)
  • change rownames (action: "set_rownames" + row_names)
  • rename categories of a factor variable (action: "recode_categories" + new_categories)
  • reorder categories of a factor variable (action: "reorder_categories" + new_categories)

4.1 Parameters

The function has 8 arguments (the combination of parameters is detailed above).

param short desc class required default description
datasetName Dataset name character required NULL The name of a dataset to edit.
action action to perform character optional NULL An action to perform among “transpose”, “subset”, “set_dataset_nature”, “set_columns_type”, “set_rownames”, “recode_categories” and “reorder_categories”
dataset_nature new dataset nature in attributes character optional NULL A dataset nature (e.g. “generic”, “SNP data”), identical to the options offered when importing a new dataset. Only used if action is set to “set_dataset_nature”.
columns_type new columns class character optional NULL A data type among “numerical” and “character”. Only used if action is set to “set_columns_type”.
new_categories simple list of categories list optional NULL e.g. list(‘a’, ‘a’, ‘b’, NA)
row_names vector of row names character optional NULL Vector of row names of the dataset, used e.g. when action is set to “subset”.
column_names vector of column names character optional NULL Vector of column names of the dataset, used e.g. when action is set to “subset”, “set_columns_type” or t_rownames”.
retrieve name of an edited object to retrieve character optional NULL When the name of an edited dataset is given, the function outputs the corresponding objects and tables, in his current states.

Other parameters are ignored. When retrieve is ‘current’, the objects and tables related to the current analysis on datasetName are output. This is useful on the interface when going from one dataset to the other.

The r_edit_dataset function returns a list of objects containing the modified data and possibly an analysis, as well as a list of tables containing a dataview, an edition history, and the object name of the modified data

# import protein dataset
input <- "../forge/backend/R/data/protein.csv"
r_wrapp("r_import", input = input, data.name = "proteins", header = TRUE,
        sep = " ", quote = "\"", dec = ".")
## {"Messages":{"type":"notification","data":[{"type":"warning","text":"a 'row.names' column has been found, setting it to row labels."}]}}

4.2 Example: subset first 10 columns

out_edition <- r_edit_dataset(datasetName = "proteins", 
                              action = "subset", 
                              column_names = colnames(proteins)[1:5])

Two outputs are produced: Object and Table.

names(out_edition)
## [1] "Object" "Table"

The Object component contains two objects:

names(out_edition$Object)
## [1] "editor" "edited"
  • editor: the internal edition object, that contains the instructions and the history of the edition.
  • edited: the new dataset.

The Table component contains three objects:

names(out_edition$Table)
## [1] "Colnames"       "Rownames"       "ObjectName"     "HistoryEdition"
## [5] "CatVarNames"

The DataView and HistoryEdition objects will be printed on the interface.

out_edition$Table$ObjectName is the name of the dataset that has been created (as in the R environment, when the function is called with r_wrapp).

4.3 Performs several actions

we use r_wrapp to reproduce the interface behaviour:

out_edition <- r_wrapp(funcName = "r_edit_dataset", 
                       datasetName = "proteins", 
                       action = "subset", 
                       column_names = colnames(proteins)[1:5])

Two new objects are created in the environment:

  • the analysis (editor_1)
  • the edited dataset (edited_1), hidden for now.

The editor analysis contains the list of successive actions (here only 1) performed on the dataset.

jsonview::json_tree_view(editor_1, scroll = TRUE)

The edited dataset contains the resulting dataset:

knitr::kable(head(edited_1))
14.3.3_epsilon 4E.BP1 4E.BP1_pS65 4E.BP1_pT37 4E.BP1_pT70
A0SH -0.2146377 -0.0348712 -0.1492393 -0.1141046 0.1883281
A0SJ 0.1343556 0.2398837 -0.1730644 -0.8782864 -0.1662046
A0SK 0.2186520 2.3489380 0.4798254 1.2803679 0.8254676
A0SO -0.1124943 0.2786066 0.1776554 0.6728072 0.4581671
A04N 0.0021150 0.7086549 -0.2063496 0.6498476 0.1838435
A04P 0.0717803 -0.1426144 0.2463977 0.6771736 0.1320518

Now we perform a second action on the same dataset:

out_edition <- r_wrapp(funcName = "r_edit_dataset", 
                       datasetName = "proteins", 
                       action = "subset", 
                       row_names = rownames(proteins)[1:5])
jsonview::json_tree_view(editor_1, scroll = TRUE)

The editor analysis contains the list of successive actions (now 2) performed on the dataset.

The edited dataset has been replaced with the new created dataset.

knitr::kable(edited_1)
14.3.3_epsilon 4E.BP1 4E.BP1_pS65 4E.BP1_pT37 4E.BP1_pT70
A0SH -0.2146377 -0.0348712 -0.1492393 -0.1141046 0.1883281
A0SJ 0.1343556 0.2398837 -0.1730644 -0.8782864 -0.1662046
A0SK 0.2186520 2.3489380 0.4798254 1.2803679 0.8254676
A0SO -0.1124943 0.2786066 0.1776554 0.6728072 0.4581671
A04N 0.0021150 0.7086549 -0.2063496 0.6498476 0.1838435

4.4 Extract the dataset

When the edition is done, the resulting dataset can be saved by calling the r_extract_dataset function.

res <- r_wrapp("r_extract_dataset", datasetName = "edited_1", userName = "test")

4.5 Retrieve the current actions performed on a dataset

Selecting a dataset on the edition screen will call the action “retrieve” to get any analysis not “finished” (i.e. not extracted yet) if there is one, with all the already performed actions.

# Not run
out_retrieve <- r_edit_dataset(
    datasetName = "proteins", 
    retrieve = "current")

It has the same structure as any r_edit_dataset output: a list of length 2 (Object and Table).

  • The Object component is the edited dataset (when no edition yet, the current state of the dataset).
  • The Table component contains a DataView, the ObjectName (of the name of the resulting dataset) and the history.

When an edition object is retrieved from the workspace screen (by clicking on “more”), the function will retrieve the current edition analysis by calling the name of the produced dataset:

# Not run
out_retrieve <- r_edit_dataset(
    datasetName = "proteins", 
    retrieve = "edited_1")

4.6 Edit a retrieved dataset

It is possible to edit an already extracted dataset or, more precisely, to start a new edition based on an extracted dataset. Indeed, immediately after retrieval, any action creates a new editor object, associated with a hidden edited object. If there is a current edition on another - not extracted - dataset at the time of performing the action, then it is deleted. This ensures that there is always at most one edition going on (for one original dataset).
To achieve this, an attribute - called edit_head - is added to the original dataset at the time of retrieval. It points to the dataset on which the following action should be performed. This attribute is updated at every call of r_edit_dataset so that it points to an already extracted dataset only after retrieval. After any other action, it is set to NULL - or removed, so that the regular workflow is preserved.