Category: Uncategorized

Analyzing a Dendrogram

dendrogram is a tree data structure that allows us to represent the entire clustering hierarchy produced by either an agglomerative or divisive algorithm. The idea is to put the samples on the x axis and the dissimilarity level on the y axis. Whenever two clusters are merged, the dendrogram shows a connection corresponding to the dissimilarity level at which it occurred. Hence, in an agglomerative scenario, a dendrogram always starts with all samples considered as clusters and moves upward (the direction is purely conventional) until a single cluster is defined.

For didactic purposes, it’s preferable to show the dendrogram corresponding to a very small dataset, X, but all the concepts can be applied to any situation. However, with larger datasets, it will often be necessary to apply some truncations in order to visualize the entire structure in a more compact form.

How to do it?

Let’s consider a small dataset, X, made up of 12 bidimensional samples generated by 4 Gaussian distributions with mean vectors in the range (-1, 1) × (-1, 1):

from sklearn.datasets import make_blobs

nb_samples = 12
nb_centers = 4

X, Y = make_blobs(n_samples=nb_samples, n_features=2, center_box=[-1, 1], centers=nb_centers, random_state=1000)

 The dataset (with the labels) is shown in the following screenshot:

Dataset employed for dendrogram analysis

In order to generate a dendrogram (using SciPy), we first need to create a linkage matrix. In this case, we have chosen a Euclidean metric with Ward’s linkage (I encourage you to perform the analysis with different configurations):

from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage

dm = pdist(X, metric='euclidean')
Z = linkage(dm, method='ward')

The dm array is a condensed pairwise distance matrix, while Z is the linkage matrix produced by Ward’s method (the linkage() function requires the method parameter, which accepts, among others, the values single, complete, average, and ward). At this point, we can generate and plot the dendrogram (the dendrogram () function can automatically plot the diagram using a default or supplied Matplotlib axis object):

import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram

fig, ax = plt.subplots(figsize=(12, 8))

d = dendrogram(Z, show_leaf_counts=True, leaf_font_size=14, ax=ax)

ax.set_xlabel('Samples', fontsize=14)
ax.set_yticks(np.arange(0, 6, 0.25))

plt.show()

The diagram is displayed in the following screenshot:

Dendrogram corresponding to Ward’s linkage applied to the dataset

As explained in the preceding screenshot, the x axis represents the samples intended to minimize the risk of cross-connections, while the y axis shows the dissimilarity level. Let’s now analyze the diagram from the bottom. The initial state corresponds to all samples considered as independent clusters (so the dissimilarity is null). Moving upward, we start to observe the first mergers. In particular, when the dissimilarity is about 0.35, samples 1 and 3 are merged.

The second step happens with a dissimilarity of slightly below 0.5, when the samples 0 and 9 are also merged. The process goes on until a dissimilarity of about 5.25, when a single cluster is created. Let’s now dissect the dendrogram horizontally when the dissimilarity is equal to 1.25. Looking at the underlying connections, we discover that the clustering structure is: {6}, {758}, {09410}, {11}, {213}.

Clustering

Therefore, we have five clusters, with two of them made up of a single sample. It’s not surprising to observe that samples 6 and 11 are the last ones to be merged. In fact, they are much further apart than all the other ones. In the following screenshot, four different levels are shown (only the clusters containing more than one sample are marked with a circle):

Clusters generated by cutting the dendrogram at different levels (Ward’s linkage)

As is easy to understand, the agglomeration starts by selecting the most similar clusters/samples and proceeds by adding the nearest neighbors, until the root of the tree has been reached. In our case, at a dissimilarity level equal to 2.0, three well-defined clusters have been detected. The left one is also kept in the next cut, while the two right ones (which are clearly closer) are selected for merging in order to yield a single cluster. The process itself is straightforward and doesn’t need particular explanations; however, there are two important considerations.

The first one is inherent to the dendrogram structure itself. Contrary to other methods, hierarchical clustering allows observing an entire clustering tree, and such a feature is extremely helpful when it’s necessary to show how the process evolves by increasing the dissimilarity level. For example, a product recommender application could not provide any information about the desired number of clusters representing the users, but executive management might be interested in understanding how the merging process is structured and evolves.

In fact, observing how clusters are merged allows deep insight into the underlying geometry and it’s also possible to discover which clusters could potentially be considered as parts of larger ones. In our example, at level 0.5, we have the small cluster {13}. The question of “what samples can be added to this cluster by increasing the dissimilarity?” can be answered immediately with {2}. Of course, in this case, this is a trivial problem that can be solved by looking at the data plot, but with high-dimensional datasets, it can become more difficult without the support of a dendrogram.

The second advantage of dendrograms is the possibility to compare the behavior of different linkage methods. Using Ward’s method, the first mergers happen at quite low dissimilarity levels, but there’s a large gap between five clusters and three clusters. This is a consequence of both the geometry and the merging strategy. What happens, for example, if we employ a single linkage (which is intrinsically very different)? The corresponding dendrogram is shown in the following screenshot:

Dendrogram corresponding to single linkage applied to the dataset

Conclusion

The conclusion is that the dendrogram is asymmetric and the clusters are generally merged with a single sample or with small agglomerates. Starting from the right, we can see that samples {11} and {6} were merged very late. Moreover, sample {6} (which could be an outlier) is merged at the highest dissimilarity, when the final single cluster must be produced. The process can be better understood with the following screenshot:

Clusters generated by cutting the dendrogram at different levels (single linkage)

As you can see from the screenshot, while Ward’s method generates two clusters containing all samples, a single linkage aggregates the largest blocks at Level 1.0 by keeping the potential outliers outside. Therefore, the dendrogram also allows defining aggregation semantics, which can be very helpful in a psychometric and sociological context. While Ward’s linkage proceeds in a way that is very similar to other symmetric algorithms, single linkage has a step-wise fashion that shows an underlying preference for clusters built incrementally, resulting in the avoidance of large gaps in the dissimilarity.

Finally, it’s interesting to note that, while Ward’s linkage yields a potential optimal number of clusters (three) by cutting the dendrogram at Level 3.0, single linkage never reaches such a configuration (because cluster {6} is merged only in the final step). This effect is strictly related to the double principle of maximum separation and maximum cohesion. Ward’s linkage tends to find the most cohesive and separated clusters very quickly. It allows cutting the dendrogram when the dissimilarity gap overcomes a predefined threshold (and, of course, when the desired number of clusters has been reached), while other linkages require a different approach and, sometimes, yield undesirable final configurations.

Considering the nature of the problem, I encourage you to test the behavior of all linkage methods and to find out the most appropriate method for some sample scenarios (for example, the segmentation of the population of a country according to education level, occupancy, and income). This is the best approach to increase awareness and to improve the ability to provide a semantic interpretation of the processes (which is a fundamental goal of any clustering procedure).

If you found this article interesting, you can explore Hands-On Unsupervised Learning with Python to discover the skill-sets required to implement various approaches to Machine Learning with Python. Hands-On Unsupervised Learning with Python will help you explore the concept of unsupervised learning to cluster large sets of data and analyze them repeatedly until the desired outcome is found using Python.

Managing Windows Networking with cmdlets

At the heart of every organization is the network—the infrastructure that enables client and server systems to interoperate. Windows has included networking features since the early days of Windows for Workgroups 3.1 (and earlier with Microsoft LAN Manager).

Many of the tools that IT pros use today have been around for a long time, but have more recently been replaced by PowerShell cmdlets. In this article, we look at some of the old commands and their replacement cmdlets.

New ways to do old things

Networking IT pros in the Windows Server space have been using a number of console applications to perform basic diagnostics for decades. Tools such as Ipconfig, Tracert, and NSlookup are used by IT pros all over the world. The network shell (netsh) is another veritable Swiss Army Knife set of tools to configure and manage Windows networking components.

PowerShell implements a number of cmdlets that do some of the tasks that older Win32 console applications provided. Cmdlets, such as Get-NetIPConfiguration and Resolve-DnsName, are newer alternatives to ipconfig.exe and nslookup.exe.

These cmdlets also add useful functionality. For example, using Test-NetConnection enables you to check whether a host that might block ICMP is supporting inbound traffic on a particular port. ping.exe only uses ICMP, which may be blocked somewhere in the path to the server.

One administrative benefit of using cmdlets rather than older console applications relates to remoting security. With JEA, you can constrain a user to only be able to use certain cmdlets and parameter values. In general, cmdlets make it easier for you to secure servers that are open for remoting.

This article shows you some of the new cmdlets that are available with PowerShell and Windows Server 2019.

Getting ready

This recipe uses two servers: DC1.Reskit.Org and SRV1.Reskit.Org. DC1 is a domain controller in the Reskit.Org domain and SRV1 is a member server. This article assumes that you have already set up DC1 as a domain controller. Run this exercise on SRV1.

How to do it…

  • Examine two ways to retrieve the IP address configuration (ipconfig versus a new cmdlet):
# Two variations on the old way
ipconfig.exe
ipconfig.exe /all

# The new Way
Get-NetIPConfiguration
  • Ping a computer:
# The old way
Ping DC1.Reskit.Org -4

# The New way
Test-NetConnection -ComputerName DC1.Reskit.Org

# And some new things Ping does not do!
Test-NetConnection -ComputerName DC1.Reskit.Org -CommonTCPPort SMB
$ILHT = @{InformationLevel = 'Detailed'}
Test-NetConnection -ComputerName DC1.Reskit.Org -port 389 @ILHT
  • Use the sharing folder from DC1:
# The old way to use a shared folder
net use X:  \\DC1.Reskit.Org\C$

# The new way using an SMB cmdlet
New-SMBMapping -LocalPath 'Y:' -RemotePath '\\DC1.Reskit.Org\C$'

# See what is shared the old way:
net use

# And the new way
Get-SMBMapping
  • Share a folder from SRV1:
# Now share the old way
net share Windows=C:\windows
# and the new way
New-SmbShare -Path C:\Windows -Name Windows2
# And see what has been shared the old way
net share
# and the new way
Get-SmbShare
  • Display the contents of the DNS client cache:
# The old way to see the DNS Client Cache
ipconfig /displaydns
# Vs
Get-DnsClientCache
  • Clear the DNS client cache using old and new methods:
ipconfig /flushdns
# Vs the new way
Clear-DnsClientCache
  • Perform DNS lookups:
nslookup DC1.Reskit.Org
Resolve-DnsName -Name DC1.Reskit.Org  -Type ALL

How it works…

In step 1, you examined the old/new way to view the IP configuration of a Windows host using ipconfig.exe and the Get-NetIPConfiguration cmdlet. First, you looked at two variations of using ipconfig.exe, which looks like this:

The Get-NetIPConfiguration cmdlet returns similar information, as follows:

In step 2, you examined the ping.exe command and the newer Test-NetConnection cmdlet. Using these two commands to ping DC1 (from SRV1) looks like this:

The Test-NetConnection cmdlet is also able to do some things that ping.exe cannot do, including testing access to a specific port (as opposed to just using ICMP) on the target host and providing more detailed information about connecting to that remote port, as you can see here:

In step 3, you examined new and old ways to create a drive mapping on the local host (that points to a remotely shared folder). The net.exe command, which has been around since the days of Microsoft LAN Manager, enables you to create and view drive mappings. The SMB cmdlets perform similar functions, as you can see here:

In step 4, you created and viewed an SMB share on SRV1, using both net.exe and the SMB cmdlets. This step looks like this:

DNS is all too often the focus of network troubleshooting activity. The Windows DNS client holds a cache of previously resolved network names and their IP addresses. This avoids Windows systems from having to perform DNS lookups every time a network host name is used. In step 5, you looked at the old and new ways to view the local DNS cache, which looks like this:

One often-used network troubleshooting technique involves clearing the DNS client cache. You can use ipconfig.exe or the Clear-DNSClientCache cmdlet, as shown in step 6. Neither the ipconfig.execommand or the Clear-DNSClientCache cmdlet produce any output.

Another troubleshooting technique involves asking the DNS server to resolve a DNS name. Traditionally, you would have used nslookup.exe. This is replaced with the Resolve-DNSName cmdlet. The two methods that you used in step 7 look like this:

There’s more…

In step 1, you looked at two ways of discovering a host’s IP configuration. Get-NetIPconfiguration, by default, returns the host’s DNS server IP address, whereas ipconfig.exe doesn’t. On the other hand, ipconfig.exe is considerably quicker.

Ping is meant to stand for Packet InterNetwork Groper and has been an important tool to determine whether a remote system is online. ping.exe uses ICMP echo request/reply, but many firewalls block ICMP (it has been an attack vector in the past). The Test-NetConnection cmdlet has the significant benefit that it can test whether the remote host has a particular port open. On the other hand, the host might block ICMP, if the host is to provide a service, for example, SMB shares, then the relevant port has to be open. Thus, Test-NetConnection is a lot more useful for network troubleshooting.

In step 2, you pinged a server. In addition to ping.exe, there are numerous third-party tools that can help you determine whether a server is online. The TCPing application, for example, pings a server on a specific port using TCP/IP by opening and closing a connection on the specified port. You can download this free utility from https://www.elifulkerson.com/projects/tcping.php.

If you found this article interesting, you can explore Windows Server 2019 Automation with PowerShell Cookbook to automate Windows server tasks with the powerful features of the PowerShell Language. Windows Server 2019 Automation with PowerShell Cookbook helps the reader learn how to use PowerShell and manage core roles, features, and services of Windows Server 2019.

Building the KNN algorithm With JavaScript

k-Nearest Neighbor (KNN)

The KNN is a simple, fast, and straightforward classification algorithm. It is very useful for categorized numerical datasets, where the data is naturally clustered. It will feel similar in some ways to the k-means clustering algorithm; with the major distinction being that k-means is an unsupervised algorithm while KNN is a supervised learning algorithm.

If you wish to perform a KNN analysis manually, here’s how it should go: first, plot all your training data on a graph and label each point with its category or label. When you wish to classify a new, unknown point, put it on the graph and find the k closest points to it (the nearest neighbors).

The number k should be an odd number in order to avoid ties; three is a good starting point, but some applications will need more and some can get away with one. Report whatever the majority of the k nearest neighbors is classified, as that will be the result of the algorithm.

Finding the k nearest neighbors to a test point is straightforward, but you can use some optimizations if your training data is very large. Typically, when evaluating a new point, you would calculate the Euclidean distance (the typical, high school geometry distance measure) between your test point and every other training point, and sort them by distance. This algorithm is quite fast because the training data is generally not more than 10,000 points or so.

If you have many training examples (in the order of millions) or you really need the algorithm to be lightning-fast, there are two optimizations you can make. The first is to skip the square root operation in the distance measure and use the squared distance instead. While modern CPUs are very fast, the square root operation is still much slower than multiplication and addition, so you can save a few milliseconds by avoiding the square root.

The second optimization is to only consider points within some bounding rectangle of distance to your test point; for instance, only consider points within +/- 5 units in each dimension from the test point’s location. If your training data is dense, this optimization will not affect the results but will speed up the algorithm because it will avoid calculating distances for many points.

The following is the KNN algorithm as a high-level description:

  • Record all training data and their labels
  • Given a new point to evaluate, generate a list of its distances to all training points
  • Sort the list of distances in the order of closest to farthest
  • Throw out all but the knearest distances
  • Determine which label represents the majority of your knearest neighbors; this is the result of the algorithm

A more efficient version avoids maintaining a large list of distances that need to be sorted by limiting the list of distances to k items. Now get started with your implementation of the KNN algorithm.

Building the KNN algorithm

Since the KNN algorithm is quite simple, you can build your own implementation:

  1. Create a new folder and name it Ch5-knn.
  2. Add the following jsonfile to the folder. Here, you have added a dependency for the jimp library, which is an image processing library:
{

"name": "Ch5-knn",

"version": "1.0.0",

"description": "ML in JS Example for Chapter 5 - k-nearest-neighbor",

"main": "src/index.js",

"author": "Burak Kanber",

"license": "MIT",

"scripts": {

"build-web": "browserify src/index.js -o dist/index.js -t [ babelify --presets [ env ] ]",

"build-cli": "browserify src/index.js --node -o dist/index.js -t [ babelify --presets [ env ] ]",

"start": "yarn build-cli && node dist/index.js"

},

"dependencies": {

"babel-core": "^6.26.0",

"babel-plugin-transform-object-rest-spread": "^6.26.0",

"babel-preset-env": "^1.6.1",

"babelify": "^8.0.0",

"browserify": "^15.1.0",

"jimp": "^0.2.28"

}

}
  1. Run the yarn installcommand to download and install all the dependencies and then create subfolders called src, dist, and files.
  2. Inside the srcfolder, create an js file and a knn.js file.

You will also need a data.js file. For these examples, a larger dataset has been used which is difficult to be printed here, so you should take a minute to download the Ch5-knn/src/data.js file from GitHub. You can also find the complete code for this article at https://github.com/PacktPublishing/Hands-On-Machine-Learning-with-JavaScript/tree/master/Chapter05/Ch5-knn.

  1. Start with the jsfile. You’ll need a distance-measuring function. Add the following to the beginning of knn.js:
/**

 * Calculate the distance between two points.

 * Points must be given as arrays or objects with equivalent keys.

 * @param {Array.<number>} a

 * @param {Array.<number>} b

 * @return {number}

 */

const distance = (a, b) => Math.sqrt(

    a.map((aPoint, i) => b[i] - aPoint)

        .reduce((sumOfSquares, diff) => sumOfSquares + (diff*diff), 0)

);

If you really need a performance optimization for your KNN implementation, this is where you might omit the Math.sqrt operation and return just the squared distance. However, since this is such a fast algorithm by nature, you need to do this only if you’re working on an extreme problem with a lot of data or with very strict speed requirements.

  1. Next, add the stub of your KNN class. Add the following to js, beneath the distance function:
class KNN {
constructor(k = 1, data, labels) {
this.k = k;
this.data = data;
this.labels = labels;

    }
}
export default KNN;

The constructor accepts three arguments: the k or the number of neighbors to consider when classifying your new point, the training data split up into the data points alone, and a corresponding array of their labels.

  1. Next, you need to add an internal method that considers a test point and calculates a sorted list of distances from the test point to the training points. You can call this a distance map. Add the following to the body of the KNN class:
generateDistanceMap(point) {



const map = [];

let maxDistanceInMap;



for (let index = 0, len = this.data.length; index < len; index++) {



const otherPoint = this.data[index];

const otherPointLabel = this.labels[index];

const thisDistance = distance(point, otherPoint);



/**

         * Keep at most k items in the map.

         * Much more efficient for large sets, because this

         * avoids storing and then sorting a million-item map.

         * This adds many more sort operations, but hopefully k is small.

         */

if (!maxDistanceInMap || thisDistance < maxDistanceInMap) {



            // Only add an item if it's closer than the farthest of the candidates

            map.push({

                index,

                distance: thisDistance,

                label: otherPointLabel

            });



            // Sort the map so the closest is first

            map.sort((a, b) => a.distance < b.distance ? -1 : 1);



            // If the map became too long, drop the farthest item

if (map.length > this.k) {

                map.pop();

            }



            // Update this value for the next comparison

            maxDistanceInMap = map[map.length - 1].distance;



        }

    }





return map;

}

This method could be easier to read, but the simpler version is not efficient for very large training sets. What you’re doing here is maintaining a list of points that might be the KNNs and storing them in map.

By maintaining a variable called maxDistanceInMap, you can loop over every training point and make a simple comparison to see whether the point should be added to your candidates’ list. If the point you’re iterating over is closer than the farthest of your candidates, you can add the point to the list, re-sort the list, remove the farthest point to keep the list small, and then update mapDistanceInMap.

If that sounds like a lot of work, a simpler version might loop overall points, add each one with its distance measurement to the map, sort the map, and then return the first k items. The downside of this implementation is that for a dataset of a million points, you’d need to build a distance map of a million points and then sort that giant list in memory.

In your version, you only ever hold k items as candidates, so you never need to store a separate million-point map. Your version does require a call to Array.sort whenever an item is added to the map. This is inefficient in its own way, as the sort function is called for each addition to the map. Fortunately, the sort operation is only for k items, where k might be something like 3 or 5.

The computational complexity of the sorting algorithm is most likely O(n log n) (for a quicksort or mergesort implementation), so it only takes about 30 data points for the sophisticated version to be more efficient than the simple version when k = 3, and for k = 5, this happens at around 3,000 data points. However, both versions are so fast that for a dataset smaller than 3,000 points, you won’t notice the difference.

  1. Finally, tie the algorithm together with the predict The predictmethod must accept a test point, and at the very least, return the determined label for the point. You can also add some additional output to the method and report the labels of the k nearest neighbors as well as the number of votes each label contributed. Add the following to the body of the KNN class:
predict(point) {



const map = this.generateDistanceMap(point);

const votes = map.slice(0, this.k);

const voteCounts = votes

// Reduces into an object like {label: voteCount}

.reduce((obj, vote) => Object.assign({}, obj, {[vote.label]: (obj[vote.label] || 0) + 1}), {})

    ;

const sortedVotes = Object.keys(voteCounts)

        .map(label => ({label, count: voteCounts[label]}))

        .sort((a, b) => a.count > b.count ? -1 : 1)

    ;



return {

label: sortedVotes[0].label,

voteCounts,

votes

};



}

This method requires a little bit of datatype juggling in JavaScript but is simple in concept. First, generate your distance map using the method you just implemented. Then, remove all data except for the k nearest points and store that in a votes variable. If you’re using 3 as k, then votes will be an array of length three.

Now that you have your k nearest neighbors, you need to figure out which label represents the majority of the neighbors. You can do this by reducing your votes array into an object called voteCounts. To get a picture of what you want voteCounts to look like, imagine that you’re looking for the three nearest neighbors and the possible categories are Male or Female. The voteCounts variable might look like this: {“Female”: 2, “Male”: 1}.

The job is still not done, however—after reducing your votes into a vote-count object, you still need to sort that and determine the majority label. You can do this by mapping the vote counts object back into an array and then sorting the array based on vote counts.

There are other ways to approach this problem of tallying votes; any method you can think of will work, as long as you can return the majority vote at the end of the day. That’s all you need to do in the knn.js file. The algorithm is complete, requiring fewer than 70 lines of code.

Now set up your index.js file and get ready to run some examples. Remember that you need to download the data.js file first. You can do this by downloading the file from https://github.com/bkanber/MLinJSBook. Now add the following to the top of index.js:

import KNN from'./knn.js';

import {weight_height} from'./data.js';

You can now try out the algorithm using a simple example.

Example – Height, weight, and gender

KNN, like k-means, can work on high-dimensional data—but, like k-means, you can only graph example data in a two-dimensional plane, so keep your example simple. The first question you’ll tackle is: can you predict a person’s biological sex given only their height and weight?

The data for this example has been downloaded from a national longitudinal survey on people’s perception of their weight. Included in the data are the respondents’ height, weight, and gender. This is what the data looks like when graphed:

Just by looking at the preceding charted data, you can get a sense as to why KNN is so effective at evaluating clustered data. It’s true that there’s no neat boundary between male and female, but if you were to evaluate a new data point of a 200 pound, 72 inches-tall person, it’s clear that all the training data around that point is male and it’s likely your new point is male, too.

Conversely, a new respondent at 125 pounds and a height of 62 inches is well into the female area of the graph, though there are a couple of males with those characteristics as well. The middle of the graph, around 145 pounds and 65 inches tall, is the most ambiguous, with an even split of male and female training points. Expect the algorithm to be uncertain about the new points in that area. As there is no clear dividing line in this dataset, you would need more features or more dimensions to get a better resolution of the boundaries.

In any case, try out a few examples. Pick five points that you may expect to be definitely male, definitely female, probably male, probably female, and indeterminable. Add the following code to index.js, beneath the two import lines:

console.log("Testing height and weight with k=5");

console.log("==========================");



constsolver1 = new KNN(5, weight_height.data, weight_height.labels);



console.log("Testing a 'definitely male' point:");

console.log(solver1.predict([200, 75]));

console.log("\nTesting a 'probably male' point:");

console.log(solver1.predict([170, 70]));

console.log("\nTesting a 'totally uncertain' point:");

console.log(solver1.predict([140, 64]));

console.log("\nTesting a 'probably female' point:");

console.log(solver1.predict([130, 63]));

console.log("\nTesting a 'definitely female' point:");

console.log(solver1.predict([120, 60]));

Run yarn start from the command line and you should see the following output. Since the KNN is not stochastic that is it does not use any random conditions in its evaluation, you should see exactly the same output with the possible exception of the ordering of votes and their indexes, if two votes have the same distance.

If you get an error when you run yarn start, make sure that your data.js file has been correctly downloaded and installed.

Here’s the output from the preceding code:

Testing heightand weight withk=5

======================================================================



 Testing a'definitely male' point:

 { label: 'Male',

 voteCounts: { Male: 5 },

votes:

 [ { index: 372, distance: 0, label: 'Male' },

 { index: 256, distance: 1, label: 'Male' },

 { index: 291, distance: 1, label: 'Male' },

 { index: 236, distance: 2.8284271247461903, label: 'Male' },

 { index: 310, distance: 3, label: 'Male' } ] }



 Testing a'probably male' point:

 { label: 'Male',

 voteCounts: { Male: 5 },

votes:

 [ { index: 463, distance: 0, label: 'Male' },

 { index: 311, distance: 0, label: 'Male' },

 { index: 247, distance: 1, label: 'Male' },

 { index: 437, distance: 1, label: 'Male' },

 { index: 435, distance: 1, label: 'Male' } ] }



 Testing a'totally uncertain' point:

 { label: 'Male',

 voteCounts: { Male: 3, Female: 2 },

votes:

 [ { index: 329, distance: 0, label: 'Male' },

 { index: 465, distance: 0, label: 'Male' },

 { index: 386, distance: 0, label: 'Male' },

 { index: 126, distance: 0, label: 'Female' },

 { index: 174, distance: 1, label: 'Female' } ] }



 Testing a'probably female' point:

 { label: 'Female',

 voteCounts: { Female: 4, Male: 1 },

votes:

 [ { index: 186, distance: 0, label: 'Female' },

 { index: 90, distance: 0, label: 'Female' },

 { index: 330, distance: 0, label: 'Male' },

 { index: 51, distance: 1, label: 'Female' },

 { index: 96, distance: 1, label: 'Female' } ] }



 Testing a'definitely female' point:

 { label: 'Female',

 voteCounts: { Female: 5 },

votes:

 [ { index: 200, distance: 0, label: 'Female' },

 { index: 150, distance: 0, label: 'Female' },

 { index: 198, distance: 1, label: 'Female' },

 { index: 147, distance: 1, label: 'Female' },

 { index: 157, distance: 1, label: 'Female' } ] }

The algorithm has determined genders just as you would have done, visually, by looking at the chart. Feel free to play with this example more and experiment with different values of k to see how results might differ for any given test point.

If you found this article interesting, you can explore Burak Kanber’s Hands-on Machine Learning with JavaScript to gain hands-on knowledge on evaluating and implementing the right model, along with choosing from different JS libraries, such as NaturalNode, brain, harthur, and classifier to design smarter applications. This book is a definitive guide to creating an intelligent web application with the best of machine learning and JavaScript.

Customizing a Simple Social Media Application Using MERN

Learn how to use the MERN stack to build simple social media application features in this tutorial by Shama Hoque.

MERN Social

MERN Social is a sample social media application, used for the purpose of this article, with rudimentary features inspired by existing social media platforms such as Facebook and Twitter. In this article, you’ll learn how to update a user profile in MERN Social that can display a user uploaded profile photo and an about description.

The main purpose of this application is to demonstrate how to use the MERN stack technologies to implement features that allow users to connect and interact over content. You can extend these implementations further, as desired, for more complex features.

The code for the complete MERN Social application is available on GitHub in the repository at github.com/shamahoque/mern-social. You can clone this code and run the application as you go through the code explanations. Note that you will need to create a MERN skeleton application by adding a working React frontend, including frontend routing and basic server-side rendering of the React views for following these steps.

The views needed for the MERN Social application can be developed by extending and modifying the existing React components in the MERN skeleton application. The following component tree shows all the custom React components that make up the MERN Social frontend and also exposes the composition structure that you can use to build out extended views:

Updating the user profile

The skeleton application only has support for a user’s name, email, and password. However, in MERN Social, you can allow users to add their description and also upload a profile photo while editing the profile after signing up:

Adding an about description

In order to store the description entered in the about field by a user, you need to add an about field to the user model in server/models/user.model.js:

about: {   

type: String,   

trim: true 

}

Then, to get the description as input from the user, you can add a multiline TextField to the EditProfile form in mern-social/client/user/EditProfile.js:

<TextField     

        id="multiline-flexible"     

        label="About"     

        multiline      rows="2"     

        value={this.state.about}     

        onChange={this.handleChange('about')}  

/>

Finally, to show the description text added to the about field on the user profile page, you can add it to the existing profile view in mern-social/client/user/Profile.js:

<ListItem> <ListItemText primary={this.state.user.about}/> </ListItem>

With this modification to the user feature in the MERN skeleton code, users can now add and update a description about them to be displayed on their profiles.

Uploading a profile photo

Allowing a user to upload a profile photo will require that you store the uploaded image file and retrieve it on request to load in the view. For MERN Social, you need to assume that the photo files uploaded by the user will be of small sizes and demonstrate how to store these files in MongoDB for the profile photo upload feature.

There are multiple ways of implementing this upload feature considering the different file storage options:

  • Server filesystem: Upload and save files to a server filesystem and store the URL to MongoDB
  • External file storage: Save files to an external storage such as Amazon S3 and store the URL in MongoDB
  • Store as data in MongoDB: Save files of a small size (less than 16 MB) to MongoDB as data of type Buffer

Updating the user model to store a photo in MongoDB

In order to store the uploaded profile photo directly in the database, you can update the user model to add a photo field that stores the file as data of type Buffer, along with its contentType in mern-social/server/models/user.model.js:

photo: {   

       data: Buffer,   

       contentType: String

}

Uploading a photo from the edit form

Users should be able to upload an image file from their local files when editing the profile. You can update the EditProfile component in client/user/EditProfile.js with an upload photo option and then attach the user selected file in the form data submitted to the server.

File input with Material-UI

You can use the HTML5 file input type to let the user select an image from their local files. The file input will return the filename in the change event when the user selects a file in mern-social/client/user/EditProfile.js:

<input accept="image/*" type="file"      

            onChange={this.handleChange('photo')}        

            style={{display:'none'}}        

            id="icon-button-file" />

To integrate this file input with Material-UI components, you can apply display:none to hide the input element from view, then add a Material-UI button inside the label for this file input. This way, the view displays the Material-UI button instead of the HTML5 file input element in mern-social/client/user/EditProfile.js:

<label htmlFor="icon-button-file">  

     <Button variant="raised" color="default" component="span">     

           Upload <FileUpload/>  

     </Button>

</label>

With the Button‘s component prop set to span, the Button component renders as a span element inside the label element. A click on the Upload span or label is registered by the file input with the same ID as the label, and as a result, the file select dialog is opened. Once the user selects a file, you can set it to state in the call to handleChange(…) and display the name in the view in mern-social/client/user/EditProfile.js:

<span className={classes.filename}>   

        {this.state.photo ? this.state.photo.name : ''}

</span>

Form submission with the file attached

Uploading files to the server with a form requires a multipart form submission. You can modify the EditProfile component to use the FormData API to store the form data in the format needed for encoding type multipart/form-data.

First, you need to initialize FormData in componentDidMount() in mern-social/client/user/EditProfile.js:

this.userData = new FormData()

Next, you need to update the input handleChange function to store input values for both the text fields and the file input in FormData in mern-social/client/user/EditProfile.js:

handleChange = name => event => { 

   const value = name === 'photo'   

      ? event.target.files[0]   

      : event.target.value 

this.userData.set(name, value) 

this.setState({ [name]: value })

}

Then, on clicking submit, this.userData is sent with the fetch API call to update the user. As the content type of the data sent to the server is no longer ‘application/json’, you’ll also need to modify the update fetch method in api-user.js to remove Content-Type from the headers in the fetch call in mern-social/client/user/api-user.js:

const update = (params, credentials, user) => { 

    return fetch('/api/users/' + params.userId, {   

      method: 'PUT',   

      headers: {     

          'Accept': 'application/json',     

          'Authorization': 'Bearer ' + credentials.t   

      },   

      body: user

   }).then((response) => {   

     return response.json() 

 }).catch((e) => {   

   console.log(e) 

 })

}

Now if the user chooses to upload a profile photo when editing profile, the server will receive a request with the file attached along with the other field values.

Processing a request containing a file upload

On the server, to process the request to the update API that may now contain a file, you need to use the formidable npm module:

npm install --save formidable

Formidable will allow you to read the multipart form data, giving access to the fields and the file, if any. If there is a file, formidable will store it temporarily in the filesystem. You need to read it from the filesystem, using the fs module to retrieve the file type and data and store it in the photo field in the user model. The formidable code will go in the update controller in mern-social/server/controllers/user.controller.js as follows:

import formidable from 'formidable'

import fs from 'fs'

const update = (req, res, next) => {

  let form = new formidable.IncomingForm()

  form.keepExtensions = true

form.parse(req, (err, fields, files) => {

    if (err) {

      return res.status(400).json({

        error: "Photo could not be uploaded"

      })

    }

    let user = req.profile

    user = _.extend(user, fields)

    user.updated = Date.now()

    if(files.photo){

      user.photo.data = fs.readFileSync(files.photo.path)

      user.photo.contentType = files.photo.type

    }

    user.save((err, result) => {

      if (err) {

        return res.status(400).json({

          error: errorHandler.getErrorMessage(err)

        })

      }

      user.hashed_password = undefined

      user.salt = undefined

      res.json(user)

    })

  })

}

This will store the uploaded file as data in the database. Next, you’ll need to set up file retrieval to be able to access and display the photo uploaded by the user in the frontend views.

Retrieving a profile photo

The simplest option to retrieve the file stored in the database and show it in a view is to set up a route that will fetch the data and return it as an image file to the requesting client.

Profile photo URL

You need to set up a route to the photo stored in the database for each user and also add another route that will fetch a default photo if the given user has not uploaded a profile photo in mern-social/server/routes/user.routes.js:

router.route('/api/users/photo/:userId')

  .get(userCtrl.photo, userCtrl.defaultPhoto)

router.route('/api/users/defaultphoto')

  .get(userCtrl.defaultPhoto)

You need to look for the photo in the photo controller method and if found, send it in the response to the request at the photo route; otherwise, you’ll need to call next() to return the default photo in mern-social/server/controllers/user.controller.js:

const photo = (req, res, next) => {

  if(req.profile.photo.data){

    res.set("Content-Type", req.profile.photo.contentType)

    return res.send(req.profile.photo.data)

  }

  next()

}

The default photo is retrieved and sent from the server’s file system in mern-social/server/controllers/user.controller.js:

import profileImage from './../../client/assets/images/profile-pic.png'

const defaultPhoto = (req, res) => {

  return res.sendFile(process.cwd()+profileImage)

}

Showing a photo in a view

With the photo URL routes set up to retrieve the photo, you can simply use these in the img element’s src attribute to load the photo in the view in mern-social/client/user/Profile.js. For example, in the Profile component, you’ll get the user ID from state and use it to construct the photo URL.

const photoUrl = this.state.user._id

          ? `/api/users/photo/${this.state.user._id}?${new Date().getTime()}`

          : '/api/users/defaultphoto'

To ensure that the img element reloads in the Profile view after the photo is updated in the edit, you need to also add a time value to the photo URL to bypass the browser’s default image caching behavior.

Then, you can set the photoUrl to the Material-UI Avatar component, which renders the linked image in the view:

<Avatar src={photoUrl}/>

The updated user profile in MERN Social can now display a user uploaded profile photo and an about description:

If you found this article helpful, you can explore Shama Hoque’s Full-Stack React Projects to unleash the power of MERN stack. This book guides you through MERN stack-based web development, for creating a basic skeleton application and extending it to build four different web applications.