To be able to run the SOM algorithm, you have to load the package called
SOMbrero
. The function used to run it is called trainSOM()
and is
detailed below.
This documentation only considers the case of dissimilarity matrices.
The trainSOM
function has several arguments, but only the first one is
required. This argument is x.data
which is the dataset used to train the
SOM. In this documentation, it is passed to the function as a matrix or a data
frame. This set must be a dissimilarity matrix, i.e., a symmetric matrix of
positive numbers, with zero entries on the diagonal.
The other arguments are the same as the arguments passed to the initSOM
function (they are parameters defining the algorithm, see help(initSOM)
for further details).
The trainSOM
function returns an object of class somRes
(see
help(trainSOM)
for further details on this class).
The following table indicates which graphics are available for a relational SOM.
Type | Energy | Obs | Prototypes | Add | Super Cluster |
---|---|---|---|---|---|
no type | x | ||||
hitmap | x | x | |||
color | x | ||||
lines | x | x | x2 | ||
barplot | x | x | x2 | ||
radar | x | x | x2 | ||
pie | x | x2 | |||
boxplot | x | ||||
3d | |||||
poly.dist | x | x | |||
umatrix | x | ||||
smooth.dist | x | ||||
words | x | ||||
names | x | x | |||
graph | x | x | |||
mds | x | x | |||
grid.dist | x | ||||
grid | x | ||||
dendrogram | x | ||||
dendro3d | x |
In the “Super Cluster” column, a plot marked by “x2” means it is available for both data set variables and additional variables.
lesmis
data setThe lesmis
data set provides the coappearance graph of the characters of
the novel Les Miserables (Victor Hugo). Each vertex stands for a character whose
name is given by the vertex label. One edge means that the corresponding two
characters appear in a common chapter in the book. Each edge also has a value
indicating the number of coappearances. The lesmis
data contain two
objects: the first one lesmis
is an igraph
object (see the igraph
web page),
with 77 nodes and 254 edges.
Further information on this data set is provided with help(lesmis)
.
data(lesmis)
lesmis
## IGRAPH U--- 77 254 --
## + attr: layout (g/n), id (v/n), label (v/c), value (e/n)
plot(lesmis, vertex.size=0)
The dissim.lesmis
object is a matrix with entries equal to the length of
the shortest path between two characters (obtained with the function
shortest.paths
of package igraph
). Note that its row and column
names have been initialized to the characters' names to ease the use of the
graphical functions of SOMbrero
.
set.seed(7383)
mis.som <- trainSOM(x.data=dissim.lesmis, type="relational", nb.save=10)
plot(mis.som, what="energy")
The dissimilarity matrix dissim.lesmis
is passed to the trainSOM
function as input. As the SOM intermediate backups have been registered
(nb.save=10
), the energy evolution can be plotted: it stabilized in the
last 100 iterations.
The clustering component provides the classification of each of the 77
characters. The table
function is a simple way to view data distribution
on the map.
mis.som$clustering
## Myriel Napoleon MlleBaptistine MmeMagloire
## 25 25 20 20
## CountessDeLo Geborand Champtercier Cravatte
## 25 25 24 25
## Count OldMan Labarre Valjean
## 24 25 11 11
## Marguerite MmeDeR Isabeau Gervais
## 16 11 11 11
## Tholomyes Listolier Fameuil Blacheville
## 17 22 22 22
## Favourite Dahlia Zephine Fantine
## 22 22 21 16
## MmeThenardier Thenardier Cosette Javert
## 7 3 12 7
## Fauchelevent Bamatabois Perpetue Simplice
## 11 1 16 16
## Scaufflaire Woman1 Judge Champmathieu
## 11 11 1 1
## Brevet Chenildieu Cochepaille Pontmercy
## 1 1 1 8
## Boulatruelle Eponine Anzelma Woman2
## 3 3 3 12
## MotherInnocent Gribier Jondrette MmeBurgon
## 11 11 5 5
## Gavroche Gillenormand Magnon MlleGillenormand
## 5 14 7 18
## MmePontmercy MlleVaubois LtGillenormand Marius
## 13 18 13 14
## BaronessT Mabeuf Enjolras Combeferre
## 14 9 10 10
## Prouvaire Feuilly Courfeyrac Bahorel
## 10 10 10 10
## Bossuet Joly Grantaire MotherPlutarch
## 15 10 10 9
## Gueulemer Babet Claquesous Montparnasse
## 3 4 4 4
## Toussaint Child1 Child2 Brujon
## 11 5 5 4
## MmeHucheloup
## 5
table(mis.som$clustering)
##
## 1 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 24 25
## 6 5 4 6 3 1 2 8 11 2 2 3 1 4 1 2 2 1 5 2 6
plot(mis.som)
The clustering can be displayed using the plot
function
with type=names
.
plot(mis.som, what="obs", type="names")
or by sur-imposing the original igraph object on the map:
plot(mis.som, what="add", type="graph", var=lesmis)
Clusters profile overviews can be plotted either with lines, barpot or radar.
plot(mis.som, what="prototypes", type="lines")
plot(mis.som, what="prototypes", type="barplot")
plot(mis.som, what="prototypes", type="radar")
On these graphics, one variable is represented respectively with a point, a bar or a slice. It is therefore easy to see which variable affects which cluster.
To see how different the clusters are, some graphics show the distances between prototypes. These graphics have exactly the same behaviour as in the other SOM types.
"poly.dist"
represents the distances between neighboring prototypes with
polygons plotted for each cell of the grid. The smaller the distance between
a polygon's vertex and a cell border, the closer the pair of prototypes.
The colors indicates the number of observations in the neuron (white is used
for empty neurons);
"umatrix"
fills the neurons of the grid using colors that represent
the average distance between the current prototype and its neighbors;
"smooth.dist"
plots the mean distance between the current prototype and
its neighbors with a color gradation;
"mds"
plots the number of the neuron on a map according to a Multi
Dimensional Scaling (MDS) projection;
"grid.dist"
plots a point for each pair of prototypes, with x
coordinates representing the distance between the prototypes in the
input space, and y coordinates representing the distance between the
corresponding neurons on the grid.
plot(mis.som, what="prototypes", type="poly.dist", print.title=TRUE)
plot(mis.som, what="prototypes", type="smooth.dist")
plot(mis.som, what="prototypes", type="umatrix", print.title=TRUE)
plot(mis.som, what="prototypes", type="mds")
plot(mis.som, what="prototypes", type="grid.dist")
Here we can see that the prototypes 21 and 22 are far from the others.
Finally, with a graphical overview of the clustering
plot(lesmis, vertex.label.color=rainbow(25)[mis.som$clustering], vertex.size=0)
legend(x="left", legend=1:25, col=rainbow(25), pch=19)
We can see that cluster 25 is very relevant to the story: as the characters of
this cluster appear only in the sub-story of the Bishop Myriel
, he is the
only connection for all other characters of cluster 21. The same kind of
conclusion holds for cluster 1, among others. Most of the other clusters have a
small number of observations: it thus seems relevant to compute super clusters.
As the number of clusters is quite important with the SOM algorithm, it is possible to perform a hierarchical clustering. First, let us have an overview of the dendrogram:
plot(superClass(mis.som))
## Warning: Impossible to plot the rectangles: no super clusters.
According to the proportion of variance explained by super clusters, 6 groups seem to be a good choice.
sc.mis <- superClass(mis.som, k=6)
summary(sc.mis)
##
## SOM Super Classes
## Initial number of clusters : 25
## Number of super clusters : 6
##
##
## Frequency table
## 1 2 3 4 5 6
## 3 5 4 4 4 5
##
## Clustering
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## 1 2 2 2 3 1 2 2 3 3 1 4 4 4 3 5 5 4 6 6 5 5 6 6 6
##
##
## ANOVA
## F : 9.712
## Degrees of freedom : 5
## p-value : 4.249e-07
## significativity : ***
table(sc.mis$cluster)
##
## 1 2 3 4 5 6
## 3 5 4 4 4 5
plot(sc.mis)
plot(sc.mis, type="grid", plot.legend=TRUE)
plot(sc.mis, type="lines", print.title=TRUE)
plot(sc.mis, type="mds", plot.legend=TRUE)
plot(sc.mis, type="dendro3d")
library(RColorBrewer)
plot(lesmis, vertex.size=0, vertex.label.color=
brewer.pal(6, "Set2")[sc.mis$cluster[mis.som$clustering]])
legend(x="left", legend=paste("SC",1:6), col=brewer.pal(6, "Set2"), pch=19)
cluster 1 constains Valjean
which has a central position in the MDS
visualization;
cluster 2 countains the Thenardier family: mister and misses Thenardier
,
their daughter Eponine
and also the characters involved in their story. It
also contains Javert who is seeking to find the main character of the story,
Valjean;
cluster 3 contains Gavroche
, the abandonned child of the
Thenardier
, and the characters of his sub-story;
cluster 4 contains Marius
and his family: his mother,
Mrs. Pontmercy
, his father, lieutenant Gillenormand
, his
grandfather Gillenormand
and his aunt miss Gillenormand
; it also
contains Cosette
, who will have an affair with him;
cluster 5 contains Myrial
and the characters involved in his
sub-story;
cluster 6 countains Fantine
and the characters involved in her
sub-story.
iris
data setThe iris
data set has already been used in the user friendly guide
devoted to numeric data.
To ensure the performance of the relational SOM, this section will compare the
results obtained with both numerical and relational SOM. In the latter case, the
matrix of pairwise distances between the observations is used as input data.
Among all possibilities (see help(dist)
), we here choose to use
the "mikowski"
distance of order 4 to enlarge large distances and reduce
small ones.
# run the numeric SOM
set.seed(4031730)
iris.som <- trainSOM(x.data=iris[,1:4])
# run the relational SOM
iris.dist <- dist(iris[,1:4], method="minkowski", diag=TRUE, upper=TRUE, p=4)
set.seed(7071731)
d.iris.som <- trainSOM(x.data=iris.dist, type="relational")
The most important thing is to correctly separate the 3 flower species. The next 2 plots show the results with both SOM types.
plot(iris.som, what="add", type="pie", variable=iris$Species,
main="species distribution with 'numeric' SOM")
plot(d.iris.som, what="add", type="pie", variable=iris$Species,
main="species distribution with 'relational' SOM")
As we chose a higher distance order in the relational
SOM
(argument p=4
, whereas the Euclidean distance corresponds to a Minkowski
distance of order 2), the result from the relational
SOM show better
separation of 'virginica' and 'versicolor' flowers: with the numeric
SOM,
these species are mixed in 7 neurons whereas they are mixed in 3 neurons with
the relational
SOM.