Chapter 4 Dataset edition

The main function is r_edit_dataset and it provides several ways to modify a dataset:

transpose (action: "transpose")
change dataset nature (action: "set_dataset_nature" + dataset_nature)
subset rows (action: "subset" + row_names)
subset columns (action: "subset" + column_names)
change column types (action: "set_columns_type" + columns_type)
change rownames (action: "set_rownames" + row_names)
rename categories of a factor variable (action: "recode_categories" + new_categories)
reorder categories of a factor variable (action: "reorder_categories" + new_categories)

4.1 Parameters

The function has 8 arguments (the combination of parameters is detailed above).

param	short desc	class	required	default	description
datasetName	Dataset name	character	required	NULL	The name of a dataset to edit.
action	action to perform	character	optional	NULL	An action to perform among “transpose”, “subset”, “set_dataset_nature”, “set_columns_type”, “set_rownames”, “recode_categories” and “reorder_categories”
dataset_nature	new dataset nature in attributes	character	optional	NULL	A dataset nature (e.g. “generic”, “SNP data”), identical to the options offered when importing a new dataset. Only used if action is set to “set_dataset_nature”.
columns_type	new columns class	character	optional	NULL	A data type among “numerical” and “character”. Only used if action is set to “set_columns_type”.
new_categories	simple list of categories	list	optional	NULL	e.g. list(‘a’, ‘a’, ‘b’, NA)
row_names	vector of row names	character	optional	NULL	Vector of row names of the dataset, used e.g. when action is set to “subset”.
column_names	vector of column names	character	optional	NULL	Vector of column names of the dataset, used e.g. when action is set to “subset”, “set_columns_type” or t_rownames”.
retrieve	name of an edited object to retrieve	character	optional	NULL	When the name of an edited dataset is given, the function outputs the corresponding objects and tables, in his current states.

Other parameters are ignored. When retrieve is ‘current’, the objects and tables related to the current analysis on datasetName are output. This is useful on the interface when going from one dataset to the other.

The r_edit_dataset function returns a list of objects containing the modified data and possibly an analysis, as well as a list of tables containing a dataview, an edition history, and the object name of the modified data

# import protein dataset
input <- "../forge/backend/R/data/protein.csv"
r_wrapp("r_import", input = input, data.name = "proteins", header = TRUE,
        sep = " ", quote = "\"", dec = ".")

## {"Messages":{"type":"notification","data":[{"type":"warning","text":"a 'row.names' column has been found, setting it to row labels."}]}}

4.2 Example: subset first 10 columns

out_edition <- r_edit_dataset(datasetName = "proteins", 
                              action = "subset", 
                              column_names = colnames(proteins)[1:5])

Two outputs are produced: Object and Table.

names(out_edition)

## [1] "Object" "Table"

The Object component contains two objects:

names(out_edition$Object)

## [1] "editor" "edited"

editor: the internal edition object, that contains the instructions and the history of the edition.
edited: the new dataset.

The Table component contains three objects:

names(out_edition$Table)

## [1] "Colnames"       "Rownames"       "ObjectName"     "HistoryEdition"
## [5] "CatVarNames"

The DataView and HistoryEdition objects will be printed on the interface.

out_edition$Table$ObjectName is the name of the dataset that has been created (as in the R environment, when the function is called with r_wrapp).

4.3 Performs several actions

we use r_wrapp to reproduce the interface behaviour:

out_edition <- r_wrapp(funcName = "r_edit_dataset", 
                       datasetName = "proteins", 
                       action = "subset", 
                       column_names = colnames(proteins)[1:5])

Two new objects are created in the environment:

the analysis (editor_1)
the edited dataset (edited_1), hidden for now.

The editor analysis contains the list of successive actions (here only 1) performed on the dataset.

jsonview::json_tree_view(editor_1, scroll = TRUE)

The edited dataset contains the resulting dataset:

knitr::kable(head(edited_1))

	14.3.3_epsilon	4E.BP1	4E.BP1_pS65	4E.BP1_pT37	4E.BP1_pT70
A0SH	-0.2146377	-0.0348712	-0.1492393	-0.1141046	0.1883281
A0SJ	0.1343556	0.2398837	-0.1730644	-0.8782864	-0.1662046
A0SK	0.2186520	2.3489380	0.4798254	1.2803679	0.8254676
A0SO	-0.1124943	0.2786066	0.1776554	0.6728072	0.4581671
A04N	0.0021150	0.7086549	-0.2063496	0.6498476	0.1838435
A04P	0.0717803	-0.1426144	0.2463977	0.6771736	0.1320518

Now we perform a second action on the same dataset:

out_edition <- r_wrapp(funcName = "r_edit_dataset", 
                       datasetName = "proteins", 
                       action = "subset", 
                       row_names = rownames(proteins)[1:5])

jsonview::json_tree_view(editor_1, scroll = TRUE)

The editor analysis contains the list of successive actions (now 2) performed on the dataset.

The edited dataset has been replaced with the new created dataset.

knitr::kable(edited_1)

	14.3.3_epsilon	4E.BP1	4E.BP1_pS65	4E.BP1_pT37	4E.BP1_pT70
A0SH	-0.2146377	-0.0348712	-0.1492393	-0.1141046	0.1883281
A0SJ	0.1343556	0.2398837	-0.1730644	-0.8782864	-0.1662046
A0SK	0.2186520	2.3489380	0.4798254	1.2803679	0.8254676
A0SO	-0.1124943	0.2786066	0.1776554	0.6728072	0.4581671
A04N	0.0021150	0.7086549	-0.2063496	0.6498476	0.1838435

4.4 Extract the dataset

When the edition is done, the resulting dataset can be saved by calling the r_extract_dataset function.

res <- r_wrapp("r_extract_dataset", datasetName = "edited_1", userName = "test")

4.5 Retrieve the current actions performed on a dataset

Selecting a dataset on the edition screen will call the action “retrieve” to get any analysis not “finished” (i.e. not extracted yet) if there is one, with all the already performed actions.

# Not run
out_retrieve <- r_edit_dataset(
    datasetName = "proteins", 
    retrieve = "current")

It has the same structure as any r_edit_dataset output: a list of length 2 (Object and Table).

The Object component is the edited dataset (when no edition yet, the current state of the dataset).
The Table component contains a DataView, the ObjectName (of the name of the resulting dataset) and the history.

When an edition object is retrieved from the workspace screen (by clicking on “more”), the function will retrieve the current edition analysis by calling the name of the produced dataset:

# Not run
out_retrieve <- r_edit_dataset(
    datasetName = "proteins", 
    retrieve = "edited_1")

4.6 Edit a retrieved dataset

It is possible to edit an already extracted dataset or, more precisely, to start a new edition based on an extracted dataset. Indeed, immediately after retrieval, any action creates a new editor object, associated with a hidden edited object. If there is a current edition on another - not extracted - dataset at the time of performing the action, then it is deleted. This ensures that there is always at most one edition going on (for one original dataset).
To achieve this, an attribute - called edit_head - is added to the original dataset at the time of retrieval. It points to the dataset on which the following action should be performed. This attribute is updated at every call of r_edit_dataset so that it points to an already extracted dataset only after retrieval. After any other action, it is set to NULL - or removed, so that the regular workflow is preserved.