A **dendrogram** is a tree data structure
that allows us to represent the entire clustering hierarchy produced
by either an agglomerative or divisive algorithm. The idea is to put the
samples on the **x** axis and the dissimilarity level on the **y** axis.
Whenever two clusters are merged, the dendrogram shows a connection
corresponding to the dissimilarity level at which it occurred. Hence,
in an agglomerative scenario, a dendrogram always starts with all samples
considered as clusters and moves upward (the direction is purely conventional)
until a single cluster is defined.

For didactic purposes, it’s preferable to show the dendrogram
corresponding to a very small dataset, **X**, but all the concepts can
be applied to any situation. However, with larger datasets, it will often be
necessary to apply some truncations in order to visualize the entire structure
in a more compact form.

## How to do it?

Let’s consider a small dataset, **X**, made up of 12 bidimensional samples generated by 4 Gaussian
distributions with mean vectors in the range (**-1, 1**) × (**-1, 1**):

1 2 3 4 5 6 |
from sklearn.datasets import make_blobs nb_samples = 12 nb_centers = 4 X, Y = make_blobs(n_samples=nb_samples, n_features=2, center_box=[-1, 1], centers=nb_centers, random_state=1000) |

The dataset (with the labels) is shown in the following screenshot:

In order to generate a dendrogram (using SciPy), we first need to create a linkage matrix. In this case, we have chosen a Euclidean metric with Ward’s linkage (I encourage you to perform the analysis with different configurations):

1 2 3 4 5 |
from scipy.spatial.distance import pdist from scipy.cluster.hierarchy import linkage dm = pdist(X, metric='euclidean') Z = linkage(dm, method='ward') |

The dm array is a condensed pairwise distance matrix, while Z is the linkage matrix produced by Ward’s method (the linkage() function requires the method parameter, which accepts, among others, the values single, complete, average, and ward). At this point, we can generate and plot the dendrogram (the dendrogram () function can automatically plot the diagram using a default or supplied Matplotlib axis object):

1 2 3 4 5 6 7 8 9 10 11 12 |
import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram fig, ax = plt.subplots(figsize=(12, 8)) d = dendrogram(Z, show_leaf_counts=True, leaf_font_size=14, ax=ax) ax.set_xlabel('Samples', fontsize=14) ax.set_yticks(np.arange(0, 6, 0.25)) plt.show() |

The diagram is displayed in the following screenshot:

As explained in the preceding screenshot, the **x** axis
represents the samples intended to minimize the risk of
cross-connections, while the **y** axis shows the dissimilarity
level. Let’s now analyze the diagram from the bottom. The initial state
corresponds to all samples considered as independent clusters (so the
dissimilarity is null). Moving upward, we start to observe the first mergers.
In particular, when the dissimilarity is about 0.35, samples **1** and **3** are merged.

The second step happens with a dissimilarity of slightly below 0.5, when
the samples **0** and **9** are also
merged. The process goes on until a dissimilarity of about 5.25, when a single
cluster is created. Let’s now dissect the dendrogram horizontally when the
dissimilarity is equal to 1.25. Looking at the underlying connections, we
discover that the clustering structure is: {**6**}, {**7**, **5**, **8**}, {**0**, **9**, **4**, **10**}, {**11**}, {**2**, **1**, **3**}.

## Clustering

Therefore, we have five clusters, with two of them made up of
a single sample. It’s not surprising to observe that samples **6** and **11** are the last ones to be merged. In fact,
they are much further apart than all the other ones. In the following
screenshot, four different levels are shown (only the clusters containing more
than one sample are marked with a circle):

As is easy to understand, the agglomeration starts by selecting the most
similar clusters/samples and proceeds by adding the **nearest neighbors,** until the
root of the tree has been reached. In our case, at a dissimilarity level equal
to 2.0, three well-defined clusters have been detected. The left one is
also kept in the next cut, while the two right ones (which are clearly
closer) are selected for merging in order to yield a single cluster. The
process itself is straightforward and doesn’t need particular explanations;
however, there are two important considerations.

The first one is inherent to the dendrogram structure itself. Contrary to other methods, hierarchical clustering allows observing an entire clustering tree, and such a feature is extremely helpful when it’s necessary to show how the process evolves by increasing the dissimilarity level. For example, a product recommender application could not provide any information about the desired number of clusters representing the users, but executive management might be interested in understanding how the merging process is structured and evolves.

In fact, observing how clusters are merged allows deep insight into the
underlying geometry and it’s also possible to discover which clusters could
potentially be considered as parts of larger ones. In our example, at level
0.5, we have the small cluster {**1**, **3**}. The question of “what samples can be added to this cluster by
increasing the dissimilarity?” can be answered immediately with {**2**}. Of course, in this case, this is a trivial problem that can be solved
by looking at the data plot, but with high-dimensional datasets, it can become
more difficult without the support of a dendrogram.

The second advantage of dendrograms is the possibility to
compare the behavior of different linkage methods. Using Ward’s method, the
first mergers happen at quite low dissimilarity levels, but there’s a large gap
between **five clusters** and **three**** ****clusters**. This is a consequence
of both the geometry and the merging strategy. What happens, for example, if we
employ a single linkage (which is intrinsically very different)? The
corresponding dendrogram is shown in the following screenshot:

## Conclusion

The conclusion is that the dendrogram is asymmetric and the
clusters are generally merged with a single sample or with small agglomerates.
Starting from the right, we can see that samples {**11**} and {**6**} were merged very late. Moreover, sample {**6**} (which could
be an outlier) is merged at the highest dissimilarity, when the final single
cluster must be produced. The process can be better understood with the
following screenshot:

As you can see from the screenshot, while Ward’s method
generates two clusters containing all samples, a single linkage aggregates the
largest blocks at **Level 1.0** by keeping the
potential outliers outside. Therefore, the dendrogram also allows defining
aggregation semantics, which can be very helpful in a psychometric and
sociological context. While Ward’s linkage proceeds in a way that is very similar
to other symmetric algorithms, single linkage has a step-wise fashion that
shows an underlying preference for clusters built incrementally, resulting in
the avoidance of large gaps in the dissimilarity.

Finally, it’s interesting to note that, while Ward’s linkage yields a
potential optimal number of clusters (three) by cutting the dendrogram at **Level 3.0**, single linkage never reaches such a configuration
(because cluster {**6**} is merged only in the final step). This effect
is strictly related to the double principle of maximum separation and maximum
cohesion. Ward’s linkage tends to find the most cohesive and separated clusters
very quickly. It allows cutting the dendrogram when the dissimilarity gap
overcomes a predefined threshold (and, of course, when the desired number of
clusters has been reached), while other linkages require a different approach
and, sometimes, yield undesirable final configurations.

Considering the nature of the problem, I encourage you to test the behavior of all linkage methods and to find out the most appropriate method for some sample scenarios (for example, the segmentation of the population of a country according to education level, occupancy, and income). This is the best approach to increase awareness and to improve the ability to provide a semantic interpretation of the processes (which is a fundamental goal of any clustering procedure).

*If you found this article interesting, you can
explore Hands-On
Unsupervised Learning with Python to discover the skill-sets
required to implement various approaches to Machine Learning with Python. **Hands-On
Unsupervised Learning with Python will help you explore the concept
of unsupervised learning to cluster large sets of data and analyze them
repeatedly until the desired outcome is found using Python.*