Chapter 10 SOM: Self-Organized Map
Two clustering methods (HAC and K-means) have already been implemented in Asterics. In line with this, in this document, we present a workflow for using Self Organizing Maps (SOM) as a clustering and visualization method. The aim of Self Organized Maps is to reduce the dimension of an incoming space, while preserving not the distance between individuals, but rather their neighborhoods. This method is useful when the number of variables is large, because the data are sparse in the space.
10.1 Used datasets
For illustration purposes, we will focus on two datasets: mrna and clinical. The dataset that will be approximated by SOM, mrnas, is extracted from mrna. Features in clinical dataset will be used as supplementary variables.
<- "../forge/backend/R/data/mrna.csv"
input r_import(input, data.name = "mrna", row.names = 1, assign = TRUE)
## NULL
<- "../forge/backend/R/data/clinical.csv"
input r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)
## NULL
<- mrna[1:100, ] mrnas
10.1.1 Required library
The main library used is SOMbrero
, which implements SOM algorithm for numeric and non-numeric data in R. The package description is available here.
10.2 Perform SOM
To begin, the r_som
function is called to train the SOM by setting the following arguments:
topo
: topology ("square"
or"hexagonal"
). Default value is"hexagonal"
mapLength
: length of the map. Default value to \(\max(5,\min(10,\lceil{\sqrt{n/10}}\rceil))\), withn
the number of observationsmapWidth
: width of the map. Default value to \(\max(5,\min(10,\lceil{\sqrt{n/10}}\rceil))\), withn
the number of observationsseed
: random number which allows reproducible results. Ifseed
is not provided, it is set to a value that depends on the time.
Then, it returns a nested list including:
Table
: the quality criteria table, the ratio correlation table, the SOM cluster view, the SOM cluster frequency table and the name of the output objectGraphical
: a hitmap to visualize the trained SOMObject
: the SOM objects. When called by ther_wrapp
function, two distinct entries are added to the global environment:SOMobj_id
andSOMClusters_id
. The first one returns all the elements ofTable
,Graphical
and the SOM object whilst the second provides only a data frame of the clusters.
As an example, we define a map space set to default values. Thus, we will train a \(5\times5\) hexagonal SOM on mrnas dataset.
<- r_som(datasetName = "mrnas", seed = 5) out_som
10.2.1 Hitmap
The first visualization shows the frequency of cluster memberships on the map. It represents a descriptive plot of the number of observations in every neuron of the map. The size of each neuron is proportional to the number of observations classified in it.
$Graphical$Hitmap out_som
10.2.2 Topographic and quantization errors
The quality measures introduced are topographic error and quantization error. Both help to assess the quality of the map. The quantization error assesses the quality of the clustering while the topographic error measures the topology preservation of input observations on the map.
::json_tree_view(out_som$Table$Quality, scroll = TRUE) jsonview
10.2.3 Correlation ratio
In addition, the 20 highest correlation ratios between each input variable and clusters are shown. They measure the part of variance explained by the clusters. They lie between 0 and 1, a value of 1 indicating that all the variability of the corresponding variable comes from the cluster.
::json_tree_view(out_som$Table$CorRatio, scroll = TRUE) jsonview
10.2.4 Cluster view
The table ClusterView
shows the clusters produced by the SOM. This table can be extracted by the user as a new dataset as explained in the last section.
::json_tree_view(out_som$Table$ClusterView, scroll = TRUE) jsonview
10.2.5 Cluster summary
The table ClusterSummary
shows the number of observations within a given cluster.
::json_tree_view(out_som$Table$ClusterSummary, scroll = TRUE) jsonview
10.2.6 Object name
The entry ObjectName
gives the names of both returned elements: the SOM object and the clusters.
$Table$ObjectName out_som
## $som
## [1] "SOMobj_1"
##
## $clusters
## [1] "SOMClusters_1"
10.2.7 Extract a new data frame
A new data frame of the clusters produced by the SOM can be retrieved by the user.
The second entry SOMclust_topo_mapLength_mapWith_seed
obtained when r_wrapp
is called, allows this extraction.
<- r_wrapp("r_som", datasetName = "mrnas", seed = 5) out_som_wrapped
<- r_extract_dataset("SOMClusters_1")
out_extract names(out_extract$Object)
## [1] "dataset"
Once the SOM trained and its characteristics calculated, visual explorations to help the user interpret the results are plotted, starting with the data samples.
10.3 Explore observations
SOM is used for data clustering as well as data visualization. This section is dedicated to graphics that characterize the observations.
All plots with a variable
argument can use the dataset that has been used to performs the SOM algorithm, or another dataset (in all cases, the argument datasetNew
is needed, for a easier user experience).
10.3.1 Color
The color plot takes a single numeric variable.
It assigns a color to the mean of the chosen variable for the observations classified in each neuron.
It takes the shape of either a rectangle or of a hexagonal based on the selected topology.
<- r_plotsomind("SOMobj_1", datasetNew= "mrnas", type = "color", variable = "APOD")
out_plot $Graphical$Plot out_plot
10.3.2 Boxplot
The boxplot applies to one or several numeric variables.
It displays the distributions of these variables for observations classified in each neuron.
<- c("APOD", "NTN4", "FADS2", "C4A", "MGP") selected_vars
<- r_plotsomind("SOMobj_1", datasetNew = "mrnas", variable = selected_vars, type = "boxplot")
out_plot $Graphical$Plot out_plot
10.3.3 Names
This plot randomly displays the observation names in their corresponding cluster. When the number of samples is higher than 200, this plot can not be obtained.
<- r_plotsomind("SOMobj_1", type = "names") out_plot
::include_graphics(out_plot$Graphical$Plot$path) knitr
10.3.4 Pie
For example, the patient distribution gender for the observations classified in each neuron of the map can be represented with a pie chart. This graphic is exported as a static image.
<- r_plotsomind("SOMobj_1", datasetNew = "clinical",
out_plot variable = "patient.gender", type = "pie")
::json_tree_view(out_plot$Graphical$Plot, scroll = TRUE) jsonview
::include_graphics(out_plot$Graphical$Plot$path) knitr
10.4 Explore prototypes
This section describes SOM visualizations that represents the prototypes instead of the average of the observations.
10.4.1 Color
<- r_plotsomproto("SOMobj_1", variable = "APOD", type = "color")
out_plot $Graphical$Plot out_plot
10.4.2 Multidimensional scaling
A Multidimensional scaling plots a two-dimensional projection of the prototypes. It allows the user to check the distortion between the map and organization of the prototypes in the input space.
<- r_plotsomproto("SOMobj_1", type = "mds")
out_plot $Graphical$Plot out_plot
10.4.3 Polydist
This graphic shows a way to explore the distances between each prototype and its neighboring prototypes in the input space. Polygons are used to display this information and, in addition, they are colored by observation frequency in the corresponding clusters. Distant polygons on the map correspond to distant prototypes for the corresponding clusters in the input space.
<- r_plotsomproto("SOMobj_1", type = "poly.dist")
out_plot $Graphical$Plot out_plot
The table below lists all the graphics available in each section.
10.5 Superclustering
The clusters found in the SOM are usually two small for a simple clustering purpose. This is why the map units can be grouped into larger classes called superclusters.
10.5.1 Overview
Superclustering is performed using the function r_super_cluster
.
<- r_super_cluster("SOMobj_1") out_sc
<- r_wrapp("r_super_cluster", "SOMobj_1") out_sc_wrapped
10.5.2 Superclustering cut
If k
, the number of clusters is chosen, r_cut_supercluster
cuts the HC of prototypes at the corresponding number of clusters.
<- r_cut_supercluster("SOMobj_1", 3) cut_supercluster
The following plots display various information on the obtained clusters.
10.5.2.4 Cluster summary
::json_tree_view(cut_supercluster$Table$ClusterSummary, scroll=TRUE) jsonview
10.5.2.5 Object name
The object name is:
$Table$ObjectName cut_supercluster
## [1] "SuperClusters_1"
10.5.2.6 Extract superclusters
Again, a new data.frame` providing the grouping of all individuals in superclusters can be retrieved.
<- r_wrapp("r_cut_supercluster", "SOMobj_1", 3) out_cutsupercluster_wrapped
<- r_extract_dataset("SuperClusters_1") out_extract
attr(out_extract$Object$Clusters, "type")
## NULL