Chapter 10 SOM: Self-Organized Map

Two clustering methods (HAC and K-means) have already been implemented in Asterics. In line with this, in this document, we present a workflow for using Self Organizing Maps (SOM) as a clustering and visualization method. The aim of Self Organized Maps is to reduce the dimension of an incoming space, while preserving not the distance between individuals, but rather their neighborhoods. This method is useful when the number of variables is large, because the data are sparse in the space.

10.1 Used datasets

For illustration purposes, we will focus on two datasets: mrna and clinical. The dataset that will be approximated by SOM, mrnas, is extracted from mrna. Features in clinical dataset will be used as supplementary variables.

input <- "../forge/backend/R/data/mrna.csv"
r_import(input, data.name = "mrna", row.names = 1, assign = TRUE)

## NULL

input <- "../forge/backend/R/data/clinical.csv"
r_import(input, data.name = "clinical", row.names = 1, assign = TRUE)

## NULL

mrnas <- mrna[1:100, ]

10.1.1 Required library

The main library used is SOMbrero, which implements SOM algorithm for numeric and non-numeric data in R. The package description is available here.

10.2 Perform SOM

To begin, the r_som function is called to train the SOM by setting the following arguments:

topo: topology ("square" or "hexagonal"). Default value is "hexagonal"
mapLength: length of the map. Default value to \(\max(5,\min(10,\lceil{\sqrt{n/10}}\rceil))\), with n the number of observations
mapWidth: width of the map. Default value to \(\max(5,\min(10,\lceil{\sqrt{n/10}}\rceil))\), with n the number of observations
seed: random number which allows reproducible results. If seed is not provided, it is set to a value that depends on the time.

Then, it returns a nested list including:

Table: the quality criteria table, the ratio correlation table, the SOM cluster view, the SOM cluster frequency table and the name of the output object
Graphical: a hitmap to visualize the trained SOM
Object: the SOM objects. When called by the r_wrapp function, two distinct entries are added to the global environment: SOMobj_id and SOMClusters_id. The first one returns all the elements of Table, Graphical and the SOM object whilst the second provides only a data frame of the clusters.

As an example, we define a map space set to default values. Thus, we will train a \(5\times5\) hexagonal SOM on mrnas dataset.

out_som <- r_som(datasetName = "mrnas", seed = 5)

10.2.1 Hitmap

The first visualization shows the frequency of cluster memberships on the map. It represents a descriptive plot of the number of observations in every neuron of the map. The size of each neuron is proportional to the number of observations classified in it.

out_som$Graphical$Hitmap

10.2.2 Topographic and quantization errors

The quality measures introduced are topographic error and quantization error. Both help to assess the quality of the map. The quantization error assesses the quality of the clustering while the topographic error measures the topology preservation of input observations on the map.

jsonview::json_tree_view(out_som$Table$Quality, scroll = TRUE)

10.2.3 Correlation ratio

In addition, the 20 highest correlation ratios between each input variable and clusters are shown. They measure the part of variance explained by the clusters. They lie between 0 and 1, a value of 1 indicating that all the variability of the corresponding variable comes from the cluster.

jsonview::json_tree_view(out_som$Table$CorRatio, scroll = TRUE)

10.2.4 Cluster view

The table ClusterView shows the clusters produced by the SOM. This table can be extracted by the user as a new dataset as explained in the last section.

jsonview::json_tree_view(out_som$Table$ClusterView, scroll = TRUE)

10.2.5 Cluster summary

The table ClusterSummary shows the number of observations within a given cluster.

jsonview::json_tree_view(out_som$Table$ClusterSummary, scroll = TRUE)

10.2.6 Object name

The entry ObjectName gives the names of both returned elements: the SOM object and the clusters.

out_som$Table$ObjectName

## $som
## [1] "SOMobj_1"
## 
## $clusters
## [1] "SOMClusters_1"

10.2.7 Extract a new data frame

A new data frame of the clusters produced by the SOM can be retrieved by the user. The second entry SOMclust_topo_mapLength_mapWith_seed obtained when r_wrapp is called, allows this extraction.

out_som_wrapped <- r_wrapp("r_som", datasetName = "mrnas", seed = 5)

out_extract <- r_extract_dataset("SOMClusters_1")
names(out_extract$Object)

## [1] "dataset"

Once the SOM trained and its characteristics calculated, visual explorations to help the user interpret the results are plotted, starting with the data samples.

10.3 Explore observations

SOM is used for data clustering as well as data visualization. This section is dedicated to graphics that characterize the observations.

All plots with a variable argument can use the dataset that has been used to performs the SOM algorithm, or another dataset (in all cases, the argument datasetNew is needed, for a easier user experience).

10.3.1 Color

The color plot takes a single numeric variable.

It assigns a color to the mean of the chosen variable for the observations classified in each neuron.

It takes the shape of either a rectangle or of a hexagonal based on the selected topology.

out_plot <- r_plotsomind("SOMobj_1", datasetNew= "mrnas", type = "color", variable = "APOD")
out_plot$Graphical$Plot

10.3.2 Boxplot

The boxplot applies to one or several numeric variables.

It displays the distributions of these variables for observations classified in each neuron.

selected_vars <- c("APOD", "NTN4", "FADS2", "C4A", "MGP")

out_plot <- r_plotsomind("SOMobj_1", datasetNew = "mrnas", variable = selected_vars, type = "boxplot")
out_plot$Graphical$Plot

10.3.3 Names

This plot randomly displays the observation names in their corresponding cluster. When the number of samples is higher than 200, this plot can not be obtained.

out_plot <- r_plotsomind("SOMobj_1", type = "names")

knitr::include_graphics(out_plot$Graphical$Plot$path)

10.3.4 Pie

For example, the patient distribution gender for the observations classified in each neuron of the map can be represented with a pie chart. This graphic is exported as a static image.

out_plot <- r_plotsomind("SOMobj_1", datasetNew = "clinical", 
                         variable = "patient.gender", type = "pie")
jsonview::json_tree_view(out_plot$Graphical$Plot, scroll = TRUE)

knitr::include_graphics(out_plot$Graphical$Plot$path)

10.4 Explore prototypes

This section describes SOM visualizations that represents the prototypes instead of the average of the observations.

10.4.1 Color

out_plot <- r_plotsomproto("SOMobj_1", variable = "APOD", type = "color") 
out_plot$Graphical$Plot

10.4.2 Multidimensional scaling

A Multidimensional scaling plots a two-dimensional projection of the prototypes. It allows the user to check the distortion between the map and organization of the prototypes in the input space.

out_plot <- r_plotsomproto("SOMobj_1", type = "mds")
out_plot$Graphical$Plot

10.4.3 Polydist

This graphic shows a way to explore the distances between each prototype and its neighboring prototypes in the input space. Polygons are used to display this information and, in addition, they are colored by observation frequency in the corresponding clusters. Distant polygons on the map correspond to distant prototypes for the corresponding clusters in the input space.

out_plot <- r_plotsomproto("SOMobj_1", type = "poly.dist")
out_plot$Graphical$Plot

The table below lists all the graphics available in each section.

10.5 Superclustering

The clusters found in the SOM are usually two small for a simple clustering purpose. This is why the map units can be grouped into larger classes called superclusters.

10.5.1 Overview

Superclustering is performed using the function r_super_cluster.

out_sc <- r_super_cluster("SOMobj_1")

out_sc_wrapped <- r_wrapp("r_super_cluster", "SOMobj_1")

10.5.1.1 Dendrogram

A dendrogram provides a first insight into the organization of the prototypes in hierarchical clusters.

out_sc$Graphical$Dendrogram

10.5.1.2 Object name

The object name is:

out_sc$Table$ObjectName

## [1] "SOMobj_1"

10.5.2 Superclustering cut

If k, the number of clusters is chosen, r_cut_supercluster cuts the HC of prototypes at the corresponding number of clusters.

cut_supercluster <- r_cut_supercluster("SOMobj_1", 3)

The following plots display various information on the obtained clusters.

10.5.2.1 Dendrogram

cut_supercluster$Graphical$Dendrogram

10.5.2.2 Hitmap

plotly::ggplotly(cut_supercluster$Graphical$Hitmap)

10.5.2.3 Cluster view

jsonview::json_tree_view(cut_supercluster$Table$ClusterView, scroll=TRUE)

10.5.2.4 Cluster summary

jsonview::json_tree_view(cut_supercluster$Table$ClusterSummary, scroll=TRUE)

10.5.2.5 Object name