Chapter 8 Clustering
8.1 First step: choose the number of clusters
This document describes the clustering workflow as implemented in ASTERICS.
The r_clustering
function, introduced in this document, is used as a first
step of the clustering analysis to help the user choose his/her number of
clusters. It takes the following arguments as inputs:
datasetName
: a numerical datasetmethod
: the clustering method ("hac"
: HAC or"kmeans"
: K-means)kmin
: the minimum number of clusters considered to perform the analysis. Used only for k-means and forced to 1 for HAC. When not provided for k-means, it is also silently set to 1.kmax
: the maximum number of clusters considered to perform analysis. Used only for k-means and forced to \(\frac{\text{Number of observations}}{5}\) for HAC. When not provided for k-means, it is also silently set to this valueseed
: the random seed used to ensure reproductibility of the K-means analysis. This is only for advanced user and is automatically set to a value that depends on the time by default (default forseed
isNULL
)
The returned outputs are different according to the chosen method.
- HAC
Graphical
: within-group variation plot, silhouette plot and dendrogramObject
: output of the functionhclust
(class ishclust
)Table
: chosen broken stick value (passed directly inDirectBS
) and name of the output object (ObjectName
)
- k-means
Graphical
: within-group variation plot and silhouette plotObject
: a global object with slotsGraphical
andTable
(only whenkmin != kmax
) and the list ofkmeans
objects that have not been previously performedTable
: chosen broken stick value (passed directly inBS
) and name of the output k-means objects (ObjectName
)
Note that:
- when an analysis has already been performed it is simply retrieved from the
workspace. In this case, the output does not contain the entry
Object
. For K-meansseed
is used as follow: when the seed is not provided by the user, the last clustering with the same criteria is retrieved. When it is provided, the clustering is retrieved or performed, depending on the fact that the corresponding analysis with the proper seeds already exists or not.
We will use the protein dataset for illustration:
<- "../forge/backend/R/data/protein.csv"
input r_wrapp("r_import", input = input, data.name = "proteins", nature = "proteins",
row.names = 1, sep = " ")
## NULL
8.1.1 Choose a clustering method
The available clustering methods are Hierarchical Agglomerative Clustering (HAC) and k-means.
- Hierarchical Agglomerative Clustering (HAC) is a technique that splits the data into a varying number of groups by performing a series of successive mergers of \(n\) individuals. It has been decided to force the choice of Euclidean distance combined with Ward’s method. Ward’s method merges the pair of clusters that minimize the variation of within-cluster sum of squares.
Note to the user: This clustering algorithm can cope with a high dimension but does not scale very well with the number of observations.
<- r_clustering("proteins", method = "hac") hac_proteins
- k-means clustering assigns every observation to the cluster whose center of
gravity is the closest. It thus tries to minimize the within-group sum of
squares for a given number of clusters, that has to be chosen in advance. To
begin with, centers are randomly drawn and then each point is assigned to its
closest center. By selecting
method = "kmeans"
,r_clustering
enables the user to implement this algorithm over a chosen range of clusters fromkmin
tokmax
.
Note to the user: As opposed to HAC, this method is more appropriate for a large number of observations but it does not cope well with a high dimension.
Here, we run the method for a number of clusters ranging from 3 to 30:
<- r_clustering("proteins", method = "kmeans", kmin = 3,
kmeans_proteins kmax = 20)
Note: The computational time of the method can be large if the range contains too many values.
8.1.2 Evaluate cluster qualities
The result of the function consists in two plots, displaying the evolution of
the within-group variance and of the silhouette (measure of how the clusters in
a clustering are distinct to each others), versus the number of clusters. In
addition, the optimal number of clusters suggested by the broken stick heuristic
is returned (restricted to values between kmin
and kmax
). A red dashed
vertical lines shows this value in the plots.
8.1.2.1 Within-clusters evaluation
The goal is to identify the number of clusters that produces small total within-cluster variation. A trade-off between the number of clusters and the inertia could visually be identified by the user.
HAC
$Graphical$Variation hac_proteins
k-means
$Graphical$Variation kmeans_proteins
8.1.2.2 Silhouette evaluation
Similarly, the plot should help identify a number of clusters that is a good tradeoff between separability of the clusters and number of clusters.
HAC
$Graphical$Silhouette hac_proteins
k-means
$Graphical$Silhouette kmeans_proteins
8.1.2.3 Broken stick
HAC
$Table$BS hac_proteins
## [1] 10
k-means
$Table$BS kmeans_proteins
## [1] 9
8.1.2.4 Dendrogram
r_clustering
returns one more output for HAC: a dendrogram. It illustrates the
clusters produced at each step. The height represents the difference of
within-group inertia before and after the merger. An horizontal cut of the tree
at a chosen height defines a clustering. Hovering over the tree provides the
user with information about the clusterings. Finally, the number of clusters
chosen by broken stick is also displayed as an horizontal line.
$Graphical$Dendrogram hac_proteins
8.1.2.5 Object name
In $Table
, an entry $ObjectName
is included that gives the name of the
global output object (HAC) or of all the output k-means objects (k-means), as
recorded in the object_db
file. The last integer in the name corresponds is
the number of existing object of the same nature + 1 (it is thus a unique
identifier).
<- jsonlite::toJSON(hac_proteins$Table, pretty = TRUE, auto_unbox = TRUE)
out ::json_tree_view(out, height='150px') jsonview
<- jsonlite::toJSON(kmeans_proteins$Table, pretty = TRUE, auto_unbox = TRUE)
out ::json_tree_view(out, scroll = TRUE) jsonview
8.1.2.6 Additional objects
In addition to the multiple outputs already described, r_clustering
returns
objects from hclust
and kmeans
that are assigned to the global environment
by r_wrapp
and can be used (invisibly) by other functions.
names(hac_proteins$Object)
## [1] "HAC"
names(kmeans_proteins$Object)
## [1] "Kmeans" "KmeansCut_1" "KmeansCut_2" "KmeansCut_3" "KmeansCut_4"
## [6] "KmeansCut_5" "KmeansCut_6" "KmeansCut_7" "KmeansCut_8" "KmeansCut_9"
## [11] "KmeansCut_10" "KmeansCut_11" "KmeansCut_12" "KmeansCut_13" "KmeansCut_14"
## [16] "KmeansCut_15" "KmeansCut_16" "KmeansCut_17" "KmeansCut_18"
added to R workspace as:
<- c("class", "object_name", "nature", "type", "func_args",
list_attr "origin_dataset", "logt", "normalized", "parent_edge")
unlist(attributes(hac_proteins$Object$HAC)[list_attr])
## class object_name nature
## "hclust" "HACClustering_1" "HACClustering"
## type func_args.datasetName func_args.method
## "analysis" "proteins" "hac"
## func_args.kmin func_args.kmax func_args.seed
## "1" "76" "1"
## origin_dataset logt normalized
## "proteins" "nc" "nc"
## parent_edge
## "proteins"
unlist(attributes(kmeans_proteins$Object$Kmeans)[list_attr])
## object_name nature type
## "KmeansClustering_1" "KmeansClustering" "analysis"
## func_args.datasetName func_args.method func_args.kmin
## "proteins" "kmeans" "3"
## func_args.kmax func_args.seed origin_dataset
## "20" "233" "proteins"
## logt normalized parent_edge
## "nc" "nc" "proteins"
unlist(attributes(kmeans_proteins$Object$KmeansCut_10)[list_attr])
## class object_name nature
## "kmeans" "KmeansCut_10" "KmeansCut"
## type func_args.datasetName func_args.method
## "hidden" "proteins" "kmeans"
## func_args.kmin func_args.kmax func_args.seed
## "12" "12" "233"
## origin_dataset logt normalized
## "proteins" "nc" "nc"
## parent_edge
## "KmeansClustering_1"
Note that, when they have never been explored with the function
r_cut_clustering
, all objects of type KmeansCut
are hidden in the DAG
(attribute type
is "hidden"
).
8.2 Second step: extract and explore clusters
Through these different analyses, the user will be able to get an idea
beforehand of the number of clusters to choose. The function
r_extract_dataset
works on the output object of r_clustering
and provides
exploratory analysis of the clustering for a chosen number of clusters.
Inputs:
datasetName
: name of output object of the functionr_clustering
as described above (also given in the output$Table$ObjectName
) for HAC or names of the output object which number of clusters correspond to the inputk
for k-meansk
: number of clusters chosen by the user
Outputs:
- HAC
Graphical
: a dotplot, a heatmap and a dendrogramObject
: adata.frame
with the clustering in the first column. The hope is that this object (that will be added invisibly to the workspace) can be displayed using the functionr_create_dataview
. This object also has an attribute"Graphical"
that contains the abovementionned plotsTable
: aDataView
that displays the clusters (for every individual), aSummaryTable
that contains the frequency table of the clusters and an entryObjectName
that contains the name of the output object as referenced inobject_db
- K-means
Graphical
: a dotplot and a heatmapObject
: adata.frame
with the clustering in the first column. The hope is that this object (that will be added invisibly to the workspace) can be displayed using the functionr_create_dataview
. This object also has an attribute"Graphical"
that contains the abovementionned plotsTable
: aDataView
that displays the clusters (for every individual), aSummaryTable
that contains the frequency table of the clusters and an entryObjectName
that contains the name of the output object as referenced inobject_db
Note that when an analysis has already been performed it is simply
retrieved from the workspace. In this case, the output does not contain the
entry Object
.
8.2.1 HAC
<- r_wrapp("r_clustering", datasetName = "proteins", method = "hac")
out_hac <- r_cut_clustering("HACClustering_1", k = 12) cut_hac
8.2.1.4 Cluster view
<- jsonlite::toJSON(cut_hac$Table$ClusterView, pretty = TRUE,
out auto_unbox = TRUE)
::json_tree_view(out, scroll = TRUE) jsonview
8.2.1.5 Cluster summary (frequency table)
<- jsonlite::toJSON(cut_hac$Table$ClusterSummary, pretty = TRUE,
out auto_unbox = TRUE)
::json_tree_view(out, scroll = TRUE) jsonview
8.2.1.6 Clusters
head(cut_hac$Object$Clusters)
## HACClusters
## A0SH 1
## A0SJ 2
## A0SK 3
## A0SO 4
## A04N 1
## A04P 4
that will be added to the workspace as
unlist(attributes(cut_hac$Object$Clusters)[list_attr])
## class object_name nature
## "data.frame" "HACClusters_1" "HACClusters"
## type func_args.datasetName func_args.k
## "hidden" "HACClustering_1" "12"
## origin_dataset logt normalized
## "proteins" "nc" "nc"
## parent_edge
## "HACCutClusters_1"
and can be used in combination with r_create_dataview
<- jsonlite::toJSON(r_create_dataview(cut_hac$Object$Clusters, 1, 5, 1, 1),
out pretty = TRUE, auto_unbox = TRUE)
::json_tree_view(out, scroll = TRUE) jsonview
8.2.2 k-means
The number of cluster chosen by the user must be included between the values
kmin
and kmax
as passed to the function r_clustering
.
<- r_wrapp("r_clustering", datasetName = "proteins",
out_kmeans method = "kmeans", kmin = 2, kmax = 8, seed = 12)
<- r_cut_clustering("KmeansCut_1") cut_kmeans
8.2.2.3 Cluster view
<- jsonlite::toJSON(cut_kmeans$Table$ClusterView, pretty = TRUE,
out auto_unbox = TRUE)
::json_tree_view(out, scroll = TRUE) jsonview
8.2.2.4 Cluster summary (frequency table)
<- jsonlite::toJSON(cut_kmeans$Table$ClusterSummary, pretty = TRUE,
out auto_unbox = TRUE)
::json_tree_view(out, scroll = TRUE) jsonview
8.2.2.5 Clusters
head(cut_kmeans$Object$Clusters)
## KmeansClusters
## A0SH 2
## A0SJ 2
## A0SK 1
## A0SO 1
## A04N 2
## A04P 1
that will be added to the workspace as
unlist(attributes(cut_kmeans$Object$Clusters)[list_attr])
## class object_name nature
## "data.frame" "KmeansClusters_1" "KmeansClusters"
## type func_args.datasetName origin_dataset
## "hidden" "KmeansCut_1" "proteins"
## logt normalized parent_edge
## "nc" "nc" "KmeansCutClusters_1"
and can be used in combination with r_create_dataview
<- jsonlite::toJSON(r_create_dataview(cut_kmeans$Object$Clusters,
out 1, 5, 1, 1),
pretty = TRUE)
::json_tree_view(out, scroll = TRUE) jsonview
8.3 Extract a new dataset
The user can create a new dataset from the clusters found in HAC or k-means.
This dataset is the one already described in “Identify clusters/Additional
objects” and its attribute visible
is simply updated to TRUE
.
8.3.1 HAC
<- r_wrapp("r_cut_clustering", datasetName = "HACClustering_1",
out_cuthac k = 4)
<- r_extract_dataset("HACClusters_1")
out_extract_hac names(out_extract_hac$Object)
## [1] "dataset"
added to R workspace as:
unlist(attributes(out_extract_hac$Object$dataset)[list_attr])
## class object_name nature
## "data.frame" "HACClusters_1" "HACClusters"
## type func_args.datasetName func_args.k
## "dataset" "HACClustering_1" "4"
## origin_dataset logt normalized
## "proteins" "nc" "nc"
## parent_edge
## "HACCutClusters_1"
8.3.2 k-means
<- r_wrapp("r_cut_clustering", datasetName = "KmeansCut_7")
out_cutkmeans <- r_extract_dataset("KmeansClusters_1")
out_extract_kmeans names(out_extract_kmeans$Object)
## [1] "dataset"
added to R workspace as:
unlist(attributes(out_extract_kmeans$Object$dataset)[list_attr])
## class object_name nature
## "data.frame" "KmeansClusters_1" "KmeansClusters"
## type func_args.datasetName origin_dataset
## "dataset" "KmeansCut_7" "proteins"
## logt normalized parent_edge
## "nc" "nc" "KmeansCutClusters_1"
Also note that the corresponding KmeansCut
object has now an attribute
type
set to "analysis"
:
unlist(attributes(KmeansCut_7)[list_attr])
## class object_name nature
## "kmeans" "KmeansCut_7" "KmeansCut"
## type func_args.datasetName func_args.method
## "hidden" "proteins" "kmeans"
## func_args.kmin func_args.kmax func_args.seed
## "8" "8" "12"
## origin_dataset logt normalized
## "proteins" "nc" "nc"
## parent_edge
## "KmeansClustering_1"
8.4 User experience
The diagram below summarizes the entire workflow. It is important to consider
the case where the constant kmin
is equal to kmax
(only for \(k\)-means). In
this case, both r_clustering
and r_cut_clustering
must be run at once and
the user will only be displayed the results from r_cut_clustering
.