Analyzing a Dendrogram
A dendrogram is a tree data structure that allows us to represent the entire clustering hierarchy produced by either an agglomerative or divisive algorithm. The idea is to put the samples on the x axis and the dissimilarity level on the y axis. Whenever two clusters are merged, the dendrogram shows a connection corresponding to the dissimilarity level at which it occurred. Hence, in an agglomerative scenario, a dendrogram always starts with all samples considered as clusters and moves upward (the direction is purely conventional) until a single cluster is defined.
For didactic purposes, it’s preferable to show the dendrogram corresponding to a very small dataset, X, but all the concepts can be applied to any situation. However, with larger datasets, it will often be necessary to apply some truncations in order to visualize the entire structure in a more compact form.
How to do it?
Let’s consider a small dataset, X, made up of 12 bidimensional samples generated by 4 Gaussian distributions with mean vectors in the range (-1, 1) × (-1, 1):
from sklearn.datasets import make_blobs
nb_samples = 12
nb_centers = 4
X, Y = make_blobs(n_samples=nb_samples, n_features=2, center_box=[-1, 1], centers=nb_centers, random_state=1000)
The dataset (with the labels) is shown in the following screenshot:
In order to generate a dendrogram (using SciPy), we first need to create a linkage matrix. In this case, we have chosen a Euclidean metric with Ward’s linkage (I encourage you to perform the analysis with different configurations):
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage
dm = pdist(X, metric='euclidean')
Z = linkage(dm, method='ward')
The dm array is a condensed pairwise distance matrix, while Z is the linkage matrix produced by Ward’s method (the linkage() function requires the method parameter, which accepts, among others, the values single, complete, average, and ward). At this point, we can generate and plot the dendrogram (the dendrogram () function can automatically plot the diagram using a default or supplied Matplotlib axis object):
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram
fig, ax = plt.subplots(figsize=(12, 8))
d = dendrogram(Z, show_leaf_counts=True, leaf_font_size=14, ax=ax)
ax.set_xlabel('Samples', fontsize=14)
ax.set_yticks(np.arange(0, 6, 0.25))
plt.show()
The diagram is displayed in the following screenshot:
As explained in the preceding screenshot, the x axis represents the samples intended to minimize the risk of cross-connections, while the y axis shows the dissimilarity level. Let’s now analyze the diagram from the bottom. The initial state corresponds to all samples considered as independent clusters (so the dissimilarity is null). Moving upward, we start to observe the first mergers. In particular, when the dissimilarity is about 0.35, samples 1 and 3 are merged.
The second step happens with a dissimilarity of slightly below 0.5, when the samples 0 and 9 are also merged. The process goes on until a dissimilarity of about 5.25, when a single cluster is created. Let’s now dissect the dendrogram horizontally when the dissimilarity is equal to 1.25. Looking at the underlying connections, we discover that the clustering structure is: {6}, {7, 5, 8}, {0, 9, 4, 10}, {11}, {2, 1, 3}.
Clustering
Therefore, we have five clusters, with two of them made up of a single sample. It’s not surprising to observe that samples 6 and 11 are the last ones to be merged. In fact, they are much further apart than all the other ones. In the following screenshot, four different levels are shown (only the clusters containing more than one sample are marked with a circle):
As is easy to understand, the agglomeration starts by selecting the most similar clusters/samples and proceeds by adding the nearest neighbors, until the root of the tree has been reached. In our case, at a dissimilarity level equal to 2.0, three well-defined clusters have been detected. The left one is also kept in the next cut, while the two right ones (which are clearly closer) are selected for merging in order to yield a single cluster. The process itself is straightforward and doesn’t need particular explanations; however, there are two important considerations.
The first one is inherent to the dendrogram structure itself. Contrary to other methods, hierarchical clustering allows observing an entire clustering tree, and such a feature is extremely helpful when it’s necessary to show how the process evolves by increasing the dissimilarity level. For example, a product recommender application could not provide any information about the desired number of clusters representing the users, but executive management might be interested in understanding how the merging process is structured and evolves.
In fact, observing how clusters are merged allows deep insight into the underlying geometry and it’s also possible to discover which clusters could potentially be considered as parts of larger ones. In our example, at level 0.5, we have the small cluster {1, 3}. The question of “what samples can be added to this cluster by increasing the dissimilarity?” can be answered immediately with {2}. Of course, in this case, this is a trivial problem that can be solved by looking at the data plot, but with high-dimensional datasets, it can become more difficult without the support of a dendrogram.
The second advantage of dendrograms is the possibility to compare the behavior of different linkage methods. Using Ward’s method, the first mergers happen at quite low dissimilarity levels, but there’s a large gap between five clusters and three clusters. This is a consequence of both the geometry and the merging strategy. What happens, for example, if we employ a single linkage (which is intrinsically very different)? The corresponding dendrogram is shown in the following screenshot:
Conclusion
The conclusion is that the dendrogram is asymmetric and the clusters are generally merged with a single sample or with small agglomerates. Starting from the right, we can see that samples {11} and {6} were merged very late. Moreover, sample {6} (which could be an outlier) is merged at the highest dissimilarity, when the final single cluster must be produced. The process can be better understood with the following screenshot:
As you can see from the screenshot, while Ward’s method generates two clusters containing all samples, a single linkage aggregates the largest blocks at Level 1.0 by keeping the potential outliers outside. Therefore, the dendrogram also allows defining aggregation semantics, which can be very helpful in a psychometric and sociological context. While Ward’s linkage proceeds in a way that is very similar to other symmetric algorithms, single linkage has a step-wise fashion that shows an underlying preference for clusters built incrementally, resulting in the avoidance of large gaps in the dissimilarity.
Finally, it’s interesting to note that, while Ward’s linkage yields a potential optimal number of clusters (three) by cutting the dendrogram at Level 3.0, single linkage never reaches such a configuration (because cluster {6} is merged only in the final step). This effect is strictly related to the double principle of maximum separation and maximum cohesion. Ward’s linkage tends to find the most cohesive and separated clusters very quickly. It allows cutting the dendrogram when the dissimilarity gap overcomes a predefined threshold (and, of course, when the desired number of clusters has been reached), while other linkages require a different approach and, sometimes, yield undesirable final configurations.
Considering the nature of the problem, I encourage you to test the behavior of all linkage methods and to find out the most appropriate method for some sample scenarios (for example, the segmentation of the population of a country according to education level, occupancy, and income). This is the best approach to increase awareness and to improve the ability to provide a semantic interpretation of the processes (which is a fundamental goal of any clustering procedure).
If you found this article interesting, you can explore Hands-On Unsupervised Learning with Python to discover the skill-sets required to implement various approaches to Machine Learning with Python. Hands-On Unsupervised Learning with Python will help you explore the concept of unsupervised learning to cluster large sets of data and analyze them repeatedly until the desired outcome is found using Python.
Programming Bitcoin with Python
Learn how to generate private and public keys, and how to create a multi-signature bitcoin address in this tutorial with python.
In order to get started with bitcoin using Python, you must install Python 3.x and the bitcoin Python library called Pi Bitcoin tools in the system.
The Pi Bitcoin tools library
To install the Pi Bitcoin tools library, open the command-line program and execute the following command:
pip install bitcoin
The best thing about this library is that it does not need to have a bitcoin node on your computer in order for you to start using it. It connects to the bitcoin network and pulls data from places such as Blockchain.info.
For a start, write the equivalent of a Hello World program for bitcoin in Python. In the hello_bitcoin.py script, the demonstration of a new bitcoin address is created using Python. Go through the following steps to run the program:
1. Import the bitcoin library:
#!/usr/bin/env python ''' Title - Hello Bitcoin This program demonstrates the creation of - private key, - public key - and a bitcoin address. ''' # import bitcoin from bitcoin import *
2. Generate a private key using the random key function:
my_private_key = random_key()
3. Display the private key on the screen:
print("Private Key: %s\n" % my_private_key)
How to generate private keys and public keys
With the private key, a public key is generated. Perform this step by passing the private key that was generated to the privtopub function, as shown here:
# Generate Public Key my_public_key = privtopub(my_private_key) print("Public Key: %s\n" % my_public_key)
Now, with the public key, generate a bitcoin address. Do this by passing the public key that is generated to the pubtoaddr function:
# Create a bitcoin address my_bitcoin_address = pubtoaddr(my_public_key) print("Bitcoin Address: %s\n" % my_bitcoin_address)
The following screenshot shows the private key, public key, and the bitcoin address that is generated:
Note that a bitcoin address is a single-use token. Just as people use email addresses to send and receive emails, you can use this bitcoin address to send and receive bitcoins. Unlike email addresses, however, people have several bitcoin addresses, and it is a must to use a unique address for every transaction.
Creating a multisignature bitcoin address
A multisignature address is an address that is associated with more than one private key. In this section, you’ll create three private keys. Multisignature addresses are useful in organizations where no single individual is trusted with authorising the spending of bitcoins.
Go through the following steps to create a multisignature bitcoin address:
- Create three private keys:
#!/usr/bin/env python ''' Title - Create multi-signature address This program demonstrates the creation of Multi-signature bitcoin address. ''' # import bitcoin from bitcoin import * # Create Private Keys my_private_key1 = random_key() my_private_key2 = random_key() my_private_key3 = random_key() print("Private Key1: %s" % my_private_key1) print("Private Key2: %s" % my_private_key2) print("Private Key3: %s" % my_private_key3) print('\n')
2. Create three public keys from those private keys using the privtopub function:
# Create Public keys my_public_key1 = privtopub(my_private_key1) my_public_key2 = privtopub(my_private_key2) my_public_key3 = privtopub(my_private_key3) print("Public Key1: %s" % my_public_key1) print("Public Key2: %s" % my_public_key2) print("Public Key3: %s" % my_public_key3) print('\n')
3. After generating the public keys, create the multisig by passing the three public keys to the mk_ multi-sig_script function. The resulting multisig is passed to the addr script function to create the multisignature bitcoin address.
# Create Multi-signature address my_multi_sig = mk_multisig_script(my_private_key1, my_private_key2, my_private_key3, 2,3) my_multi_address = scriptaddr(my_multi_sig) print("Multi signature address: %s" % my_multi_address)
4. Print the multisignature address and execute the script. The following screenshot shows the output for the multisig bitcoin address:
You can also look at the preexisting bitcoin addresses’ transactional history. You’ll need to first get a valid address from Blockchain.info.
The following screenshot shows the copied address of a bitcoin block:
Pass the copied address to the history function, as shown in the following code, along with the output to get the history of the bitcoin address, including the transactional information:
!/usr/bin/env python ''' Title - Bitcoin Transaction History This program demonstrates listing history of a bitcoin address. ''' # import bitcoin from bitcoin import * #View address transaction history a_valid_bitcoin_address = '329e5RtfraHHNPKGDMXNxtuS4QjZTXqBDg' print(history(a_valid_bitcoin_address))
Hope you found this article interesting. To learn more interesting stuff about bitcoins and Python, you can explore Hands-On Bitcoin Programming with Python. Written with an easy-to-understand approach in mind, Hands-On Bitcoin Programming with Python takes you through numerous practical examples to teach you to build software for mining and create bitcoin using Python.