Chapter 8 Clustering

8.1 First step: choose the number of clusters

This document describes the clustering workflow as implemented in ASTERICS.

The r_clustering function, introduced in this document, is used as a first step of the clustering analysis to help the user choose his/her number of clusters. It takes the following arguments as inputs:

datasetName: a numerical dataset
method: the clustering method ("hac": HAC or "kmeans": K-means)
kmin: the minimum number of clusters considered to perform the analysis. Used only for k-means and forced to 1 for HAC. When not provided for k-means, it is also silently set to 1.
kmax: the maximum number of clusters considered to perform analysis. Used only for k-means and forced to $\frac{\text{Number of observations}}{5}$ for HAC. When not provided for k-means, it is also silently set to this value
seed: the random seed used to ensure reproductibility of the K-means analysis. This is only for advanced user and is automatically set to a value that depends on the time by default (default for seed is NULL)

The returned outputs are different according to the chosen method.

HAC
- Graphical: within-group variation plot, silhouette plot and dendrogram
- Object: output of the function hclust (class is hclust)
- Table: chosen broken stick value (passed directly in DirectBS) and name of the output object (ObjectName)
k-means
- Graphical: within-group variation plot and silhouette plot
- Object: a global object with slots Graphical and Table (only when kmin != kmax) and the list of kmeans objects that have not been previously performed
- Table: chosen broken stick value (passed directly in BS) and name of the output k-means objects (ObjectName)

Note that:

when an analysis has already been performed it is simply retrieved from the workspace. In this case, the output does not contain the entry Object. For K-means seed is used as follow: when the seed is not provided by the user, the last clustering with the same criteria is retrieved. When it is provided, the clustering is retrieved or performed, depending on the fact that the corresponding analysis with the proper seeds already exists or not.

We will use the protein dataset for illustration:

input <- "../forge/backend/R/data/protein.csv"
r_wrapp("r_import", input = input, data.name = "proteins", nature = "proteins", 
        row.names = 1, sep = " ")

## NULL

8.1.1 Choose a clustering method

The available clustering methods are Hierarchical Agglomerative Clustering (HAC) and k-means.

Hierarchical Agglomerative Clustering (HAC) is a technique that splits the data into a varying number of groups by performing a series of successive mergers of $n$ individuals. It has been decided to force the choice of Euclidean distance combined with Ward’s method. Ward’s method merges the pair of clusters that minimize the variation of within-cluster sum of squares.

Note to the user: This clustering algorithm can cope with a high dimension but does not scale very well with the number of observations.

hac_proteins <- r_clustering("proteins", method = "hac")

k-means clustering assigns every observation to the cluster whose center of gravity is the closest. It thus tries to minimize the within-group sum of squares for a given number of clusters, that has to be chosen in advance. To begin with, centers are randomly drawn and then each point is assigned to its closest center. By selecting method = "kmeans", r_clustering enables the user to implement this algorithm over a chosen range of clusters from kmin to kmax.

Note to the user: As opposed to HAC, this method is more appropriate for a large number of observations but it does not cope well with a high dimension.

Here, we run the method for a number of clusters ranging from 3 to 30:

kmeans_proteins <- r_clustering("proteins", method = "kmeans", kmin = 3, 
                                kmax = 20)

Note: The computational time of the method can be large if the range contains too many values.

8.1.2 Evaluate cluster qualities

The result of the function consists in two plots, displaying the evolution of the within-group variance and of the silhouette (measure of how the clusters in a clustering are distinct to each others), versus the number of clusters. In addition, the optimal number of clusters suggested by the broken stick heuristic is returned (restricted to values between kmin and kmax). A red dashed vertical lines shows this value in the plots.

8.1.2.1 Within-clusters evaluation

The goal is to identify the number of clusters that produces small total within-cluster variation. A trade-off between the number of clusters and the inertia could visually be identified by the user.

HAC

hac_proteins$Graphical$Variation

k-means

kmeans_proteins$Graphical$Variation

8.1.2.2 Silhouette evaluation

Similarly, the plot should help identify a number of clusters that is a good tradeoff between separability of the clusters and number of clusters.

HAC

hac_proteins$Graphical$Silhouette

k-means

kmeans_proteins$Graphical$Silhouette

8.1.2.3 Broken stick

HAC

hac_proteins$Table$BS

## [1] 10

k-means

kmeans_proteins$Table$BS

## [1] 9

8.1.2.4 Dendrogram

r_clustering returns one more output for HAC: a dendrogram. It illustrates the clusters produced at each step. The height represents the difference of within-group inertia before and after the merger. An horizontal cut of the tree at a chosen height defines a clustering. Hovering over the tree provides the user with information about the clusterings. Finally, the number of clusters chosen by broken stick is also displayed as an horizontal line.

hac_proteins$Graphical$Dendrogram

8.1.2.5 Object name

In $Table, an entry $ObjectName is included that gives the name of the global output object (HAC) or of all the output k-means objects (k-means), as recorded in the object_db file. The last integer in the name corresponds is the number of existing object of the same nature + 1 (it is thus a unique identifier).

out <- jsonlite::toJSON(hac_proteins$Table, pretty = TRUE, auto_unbox = TRUE)
jsonview::json_tree_view(out, height='150px')

out <- jsonlite::toJSON(kmeans_proteins$Table, pretty = TRUE, auto_unbox = TRUE)
jsonview::json_tree_view(out, scroll = TRUE)

8.1.2.6 Additional objects

In addition to the multiple outputs already described, r_clustering returns objects from hclust and kmeans that are assigned to the global environment by r_wrapp and can be used (invisibly) by other functions.

names(hac_proteins$Object)

## [1] "HAC"

names(kmeans_proteins$Object)

##  [1] "Kmeans"       "KmeansCut_1"  "KmeansCut_2"  "KmeansCut_3"  "KmeansCut_4" 
##  [6] "KmeansCut_5"  "KmeansCut_6"  "KmeansCut_7"  "KmeansCut_8"  "KmeansCut_9" 
## [11] "KmeansCut_10" "KmeansCut_11" "KmeansCut_12" "KmeansCut_13" "KmeansCut_14"
## [16] "KmeansCut_15" "KmeansCut_16" "KmeansCut_17" "KmeansCut_18"

added to R workspace as:

list_attr <- c("class", "object_name", "nature", "type", "func_args", 
               "origin_dataset", "logt", "normalized", "parent_edge")
unlist(attributes(hac_proteins$Object$HAC)[list_attr])

##                 class           object_name                nature 
##              "hclust"     "HACClustering_1"       "HACClustering" 
##                  type func_args.datasetName      func_args.method 
##            "analysis"            "proteins"                 "hac" 
##        func_args.kmin        func_args.kmax        func_args.seed 
##                   "1"                  "76"                   "1" 
##        origin_dataset                  logt            normalized 
##            "proteins"                  "nc"                  "nc" 
##           parent_edge 
##            "proteins"

unlist(attributes(kmeans_proteins$Object$Kmeans)[list_attr])

##           object_name                nature                  type 
##  "KmeansClustering_1"    "KmeansClustering"            "analysis" 
## func_args.datasetName      func_args.method        func_args.kmin 
##            "proteins"              "kmeans"                   "3" 
##        func_args.kmax        func_args.seed        origin_dataset 
##                  "20"                 "233"            "proteins" 
##                  logt            normalized           parent_edge 
##                  "nc"                  "nc"            "proteins"

unlist(attributes(kmeans_proteins$Object$KmeansCut_10)[list_attr])

##                 class           object_name                nature 
##              "kmeans"        "KmeansCut_10"           "KmeansCut" 
##                  type func_args.datasetName      func_args.method 
##              "hidden"            "proteins"              "kmeans" 
##        func_args.kmin        func_args.kmax        func_args.seed 
##                  "12"                  "12"                 "233" 
##        origin_dataset                  logt            normalized 
##            "proteins"                  "nc"                  "nc" 
##           parent_edge 
##  "KmeansClustering_1"

Note that, when they have never been explored with the function r_cut_clustering, all objects of type KmeansCut are hidden in the DAG (attribute type is "hidden").

8.2 Second step: extract and explore clusters

Through these different analyses, the user will be able to get an idea beforehand of the number of clusters to choose. The function r_extract_dataset works on the output object of r_clustering and provides exploratory analysis of the clustering for a chosen number of clusters.

Inputs:

datasetName: name of output object of the function r_clustering as described above (also given in the output $Table$ObjectName) for HAC or names of the output object which number of clusters correspond to the input k for k-means
k: number of clusters chosen by the user

Outputs:

HAC
- Graphical: a dotplot, a heatmap and a dendrogram
- Object: a data.frame with the clustering in the first column. The hope is that this object (that will be added invisibly to the workspace) can be displayed using the function r_create_dataview. This object also has an attribute "Graphical" that contains the abovementionned plots
- Table: a DataView that displays the clusters (for every individual), a SummaryTable that contains the frequency table of the clusters and an entry ObjectName that contains the name of the output object as referenced in object_db
K-means
- Graphical: a dotplot and a heatmap
- Object: a data.frame with the clustering in the first column. The hope is that this object (that will be added invisibly to the workspace) can be displayed using the function r_create_dataview. This object also has an attribute "Graphical" that contains the abovementionned plots
- Table: a DataView that displays the clusters (for every individual), a SummaryTable that contains the frequency table of the clusters and an entry ObjectName that contains the name of the output object as referenced in object_db

Note that when an analysis has already been performed it is simply retrieved from the workspace. In this case, the output does not contain the entry Object.

8.2.1 HAC

out_hac <- r_wrapp("r_clustering", datasetName = "proteins", method = "hac")
cut_hac <- r_cut_clustering("HACClustering_1", k = 12)

8.2.1.1 Dendrogram

cut_hac$Graphical$Dendrogram

8.2.1.2 Heatmap

cut_hac$Graphical$Heatmap

8.2.1.3 Dotplot

cut_hac$Graphical$Dotplot

8.2.1.4 Cluster view

out <- jsonlite::toJSON(cut_hac$Table$ClusterView, pretty = TRUE, 
                        auto_unbox = TRUE)
jsonview::json_tree_view(out, scroll = TRUE)

8.2.1.5 Cluster summary (frequency table)

out <- jsonlite::toJSON(cut_hac$Table$ClusterSummary, pretty = TRUE, 
                        auto_unbox = TRUE)
jsonview::json_tree_view(out, scroll = TRUE)

8.2.1.6 Clusters

head(cut_hac$Object$Clusters)

##      HACClusters
## A0SH           1
## A0SJ           2
## A0SK           3
## A0SO           4
## A04N           1
## A04P           4

that will be added to the workspace as

unlist(attributes(cut_hac$Object$Clusters)[list_attr])

##                 class           object_name                nature 
##          "data.frame"       "HACClusters_1"         "HACClusters" 
##                  type func_args.datasetName           func_args.k 
##              "hidden"     "HACClustering_1"                  "12" 
##        origin_dataset                  logt            normalized 
##            "proteins"                  "nc"                  "nc" 
##           parent_edge 
##    "HACCutClusters_1"

and can be used in combination with r_create_dataview

out <- jsonlite::toJSON(r_create_dataview(cut_hac$Object$Clusters, 1, 5, 1, 1),
                        pretty = TRUE, auto_unbox = TRUE)
jsonview::json_tree_view(out, scroll = TRUE)

8.2.1.7 ObjectName

The object name is given as:

jsonview::json_tree_view(jsonlite::toJSON(cut_hac$Table$ObjectName, 
                                          pretty = TRUE, auto_unbox = TRUE), 
                         height='100px')

8.2.2 k-means

The number of cluster chosen by the user must be included between the values kmin and kmax as passed to the function r_clustering.

out_kmeans <- r_wrapp("r_clustering", datasetName = "proteins", 
                      method = "kmeans", kmin = 2, kmax = 8, seed = 12)
cut_kmeans <- r_cut_clustering("KmeansCut_1")

8.2.2.1 Heatmap

cut_kmeans$Graphical$Heatmap

8.2.2.2 Dotplot

cut_kmeans$Graphical$Dotplot

8.2.2.3 Cluster view

out <- jsonlite::toJSON(cut_kmeans$Table$ClusterView, pretty = TRUE, 
                        auto_unbox = TRUE)
jsonview::json_tree_view(out, scroll = TRUE)

8.2.2.4 Cluster summary (frequency table)

out <- jsonlite::toJSON(cut_kmeans$Table$ClusterSummary, pretty = TRUE,
                        auto_unbox = TRUE)
jsonview::json_tree_view(out, scroll = TRUE)

8.2.2.5 Clusters

head(cut_kmeans$Object$Clusters)

##      KmeansClusters
## A0SH              2
## A0SJ              2
## A0SK              1
## A0SO              1
## A04N              2
## A04P              1

that will be added to the workspace as

unlist(attributes(cut_kmeans$Object$Clusters)[list_attr])

##                 class           object_name                nature 
##          "data.frame"    "KmeansClusters_1"      "KmeansClusters" 
##                  type func_args.datasetName        origin_dataset 
##              "hidden"         "KmeansCut_1"            "proteins" 
##                  logt            normalized           parent_edge 
##                  "nc"                  "nc" "KmeansCutClusters_1"

and can be used in combination with r_create_dataview

out <- jsonlite::toJSON(r_create_dataview(cut_kmeans$Object$Clusters, 
                                          1, 5, 1, 1),
                        pretty = TRUE)
jsonview::json_tree_view(out, scroll = TRUE)

8.2.2.6 ObjectName

The object name is given as:

jsonview::json_tree_view(jsonlite::toJSON(cut_kmeans$Table$ObjectName, 
                                          pretty = TRUE, auto_unbox = TRUE), 
                         height='100px')

8.3 Extract a new dataset

The user can create a new dataset from the clusters found in HAC or k-means. This dataset is the one already described in “Identify clusters/Additional objects” and its attribute visible is simply updated to TRUE.

8.3.1 HAC

out_cuthac <- r_wrapp("r_cut_clustering", datasetName = "HACClustering_1", 
                      k = 4)
out_extract_hac <- r_extract_dataset("HACClusters_1")
names(out_extract_hac$Object)

## [1] "dataset"

added to R workspace as:

unlist(attributes(out_extract_hac$Object$dataset)[list_attr])

##                 class           object_name                nature 
##          "data.frame"       "HACClusters_1"         "HACClusters" 
##                  type func_args.datasetName           func_args.k 
##             "dataset"     "HACClustering_1"                   "4" 
##        origin_dataset                  logt            normalized 
##            "proteins"                  "nc"                  "nc" 
##           parent_edge 
##    "HACCutClusters_1"

8.3.2 k-means

out_cutkmeans <- r_wrapp("r_cut_clustering", datasetName = "KmeansCut_7")
out_extract_kmeans <- r_extract_dataset("KmeansClusters_1")
names(out_extract_kmeans$Object)

## [1] "dataset"

added to R workspace as:

unlist(attributes(out_extract_kmeans$Object$dataset)[list_attr])

##                 class           object_name                nature 
##          "data.frame"    "KmeansClusters_1"      "KmeansClusters" 
##                  type func_args.datasetName        origin_dataset 
##             "dataset"         "KmeansCut_7"            "proteins" 
##                  logt            normalized           parent_edge 
##                  "nc"                  "nc" "KmeansCutClusters_1"

Also note that the corresponding KmeansCut object has now an attribute type set to "analysis" :

unlist(attributes(KmeansCut_7)[list_attr])

##                 class           object_name                nature 
##              "kmeans"         "KmeansCut_7"           "KmeansCut" 
##                  type func_args.datasetName      func_args.method 
##              "hidden"            "proteins"              "kmeans" 
##        func_args.kmin        func_args.kmax        func_args.seed 
##                   "8"                   "8"                  "12" 
##        origin_dataset                  logt            normalized 
##            "proteins"                  "nc"                  "nc" 
##           parent_edge 
##  "KmeansClustering_1"

8.4 User experience

The diagram below summarizes the entire workflow. It is important to consider the case where the constant kmin is equal to kmax (only for $k$-means). In this case, both r_clustering and r_cut_clustering must be run at once and the user will only be displayed the results from r_cut_clustering.