Tag: data structure

Analyzing a Dendrogram

dendrogram is a tree data structure that allows us to represent the entire clustering hierarchy produced by either an agglomerative or divisive algorithm. The idea is to put the samples on the x axis and the dissimilarity level on the y axis. Whenever two clusters are merged, the dendrogram shows a connection corresponding to the dissimilarity level at which it occurred. Hence, in an agglomerative scenario, a dendrogram always starts with all samples considered as clusters and moves upward (the direction is purely conventional) until a single cluster is defined.

For didactic purposes, it’s preferable to show the dendrogram corresponding to a very small dataset, X, but all the concepts can be applied to any situation. However, with larger datasets, it will often be necessary to apply some truncations in order to visualize the entire structure in a more compact form.

How to do it?

Let’s consider a small dataset, X, made up of 12 bidimensional samples generated by 4 Gaussian distributions with mean vectors in the range (-1, 1) × (-1, 1):

 The dataset (with the labels) is shown in the following screenshot:

Dataset employed for dendrogram analysis

In order to generate a dendrogram (using SciPy), we first need to create a linkage matrix. In this case, we have chosen a Euclidean metric with Ward’s linkage (I encourage you to perform the analysis with different configurations):

The dm array is a condensed pairwise distance matrix, while Z is the linkage matrix produced by Ward’s method (the linkage() function requires the method parameter, which accepts, among others, the values single, complete, average, and ward). At this point, we can generate and plot the dendrogram (the dendrogram () function can automatically plot the diagram using a default or supplied Matplotlib axis object):

The diagram is displayed in the following screenshot:

Dendrogram corresponding to Ward’s linkage applied to the dataset

As explained in the preceding screenshot, the x axis represents the samples intended to minimize the risk of cross-connections, while the y axis shows the dissimilarity level. Let’s now analyze the diagram from the bottom. The initial state corresponds to all samples considered as independent clusters (so the dissimilarity is null). Moving upward, we start to observe the first mergers. In particular, when the dissimilarity is about 0.35, samples 1 and 3 are merged.

The second step happens with a dissimilarity of slightly below 0.5, when the samples 0 and 9 are also merged. The process goes on until a dissimilarity of about 5.25, when a single cluster is created. Let’s now dissect the dendrogram horizontally when the dissimilarity is equal to 1.25. Looking at the underlying connections, we discover that the clustering structure is: {6}, {758}, {09410}, {11}, {213}.

Clustering

Therefore, we have five clusters, with two of them made up of a single sample. It’s not surprising to observe that samples 6 and 11 are the last ones to be merged. In fact, they are much further apart than all the other ones. In the following screenshot, four different levels are shown (only the clusters containing more than one sample are marked with a circle):

Clusters generated by cutting the dendrogram at different levels (Ward’s linkage)

As is easy to understand, the agglomeration starts by selecting the most similar clusters/samples and proceeds by adding the nearest neighbors, until the root of the tree has been reached. In our case, at a dissimilarity level equal to 2.0, three well-defined clusters have been detected. The left one is also kept in the next cut, while the two right ones (which are clearly closer) are selected for merging in order to yield a single cluster. The process itself is straightforward and doesn’t need particular explanations; however, there are two important considerations.

The first one is inherent to the dendrogram structure itself. Contrary to other methods, hierarchical clustering allows observing an entire clustering tree, and such a feature is extremely helpful when it’s necessary to show how the process evolves by increasing the dissimilarity level. For example, a product recommender application could not provide any information about the desired number of clusters representing the users, but executive management might be interested in understanding how the merging process is structured and evolves.

In fact, observing how clusters are merged allows deep insight into the underlying geometry and it’s also possible to discover which clusters could potentially be considered as parts of larger ones. In our example, at level 0.5, we have the small cluster {13}. The question of “what samples can be added to this cluster by increasing the dissimilarity?” can be answered immediately with {2}. Of course, in this case, this is a trivial problem that can be solved by looking at the data plot, but with high-dimensional datasets, it can become more difficult without the support of a dendrogram.

The second advantage of dendrograms is the possibility to compare the behavior of different linkage methods. Using Ward’s method, the first mergers happen at quite low dissimilarity levels, but there’s a large gap between five clusters and three clusters. This is a consequence of both the geometry and the merging strategy. What happens, for example, if we employ a single linkage (which is intrinsically very different)? The corresponding dendrogram is shown in the following screenshot:

Dendrogram corresponding to single linkage applied to the dataset

Conclusion

The conclusion is that the dendrogram is asymmetric and the clusters are generally merged with a single sample or with small agglomerates. Starting from the right, we can see that samples {11} and {6} were merged very late. Moreover, sample {6} (which could be an outlier) is merged at the highest dissimilarity, when the final single cluster must be produced. The process can be better understood with the following screenshot:

Clusters generated by cutting the dendrogram at different levels (single linkage)

As you can see from the screenshot, while Ward’s method generates two clusters containing all samples, a single linkage aggregates the largest blocks at Level 1.0 by keeping the potential outliers outside. Therefore, the dendrogram also allows defining aggregation semantics, which can be very helpful in a psychometric and sociological context. While Ward’s linkage proceeds in a way that is very similar to other symmetric algorithms, single linkage has a step-wise fashion that shows an underlying preference for clusters built incrementally, resulting in the avoidance of large gaps in the dissimilarity.

Finally, it’s interesting to note that, while Ward’s linkage yields a potential optimal number of clusters (three) by cutting the dendrogram at Level 3.0, single linkage never reaches such a configuration (because cluster {6} is merged only in the final step). This effect is strictly related to the double principle of maximum separation and maximum cohesion. Ward’s linkage tends to find the most cohesive and separated clusters very quickly. It allows cutting the dendrogram when the dissimilarity gap overcomes a predefined threshold (and, of course, when the desired number of clusters has been reached), while other linkages require a different approach and, sometimes, yield undesirable final configurations.

Considering the nature of the problem, I encourage you to test the behavior of all linkage methods and to find out the most appropriate method for some sample scenarios (for example, the segmentation of the population of a country according to education level, occupancy, and income). This is the best approach to increase awareness and to improve the ability to provide a semantic interpretation of the processes (which is a fundamental goal of any clustering procedure).

If you found this article interesting, you can explore Hands-On Unsupervised Learning with Python to discover the skill-sets required to implement various approaches to Machine Learning with Python. Hands-On Unsupervised Learning with Python will help you explore the concept of unsupervised learning to cluster large sets of data and analyze them repeatedly until the desired outcome is found using Python.