Author: viral

Analyzing a Dendrogram

dendrogram is a tree data structure that allows us to represent the entire clustering hierarchy produced by either an agglomerative or divisive algorithm. The idea is to put the samples on the x axis and the dissimilarity level on the y axis. Whenever two clusters are merged, the dendrogram shows a connection corresponding to the dissimilarity level at which it occurred. Hence, in an agglomerative scenario, a dendrogram always starts with all samples considered as clusters and moves upward (the direction is purely conventional) until a single cluster is defined.

For didactic purposes, it’s preferable to show the dendrogram corresponding to a very small dataset, X, but all the concepts can be applied to any situation. However, with larger datasets, it will often be necessary to apply some truncations in order to visualize the entire structure in a more compact form.

How to do it?

Let’s consider a small dataset, X, made up of 12 bidimensional samples generated by 4 Gaussian distributions with mean vectors in the range (-1, 1) Ă— (-1, 1):

 The dataset (with the labels) is shown in the following screenshot:

Dataset employed for dendrogram analysis

In order to generate a dendrogram (using SciPy), we first need to create a linkage matrix. In this case, we have chosen a Euclidean metric with Ward’s linkage (I encourage you to perform the analysis with different configurations):

The dm array is a condensed pairwise distance matrix, while Z is the linkage matrix produced by Ward’s method (the linkage() function requires the method parameter, which accepts, among others, the values single, complete, average, and ward). At this point, we can generate and plot the dendrogram (the dendrogram () function can automatically plot the diagram using a default or supplied Matplotlib axis object):

The diagram is displayed in the following screenshot:

Dendrogram corresponding to Ward’s linkage applied to the dataset

As explained in the preceding screenshot, the x axis represents the samples intended to minimize the risk of cross-connections, while the y axis shows the dissimilarity level. Let’s now analyze the diagram from the bottom. The initial state corresponds to all samples considered as independent clusters (so the dissimilarity is null). Moving upward, we start to observe the first mergers. In particular, when the dissimilarity is about 0.35, samples 1 and 3 are merged.

The second step happens with a dissimilarity of slightly below 0.5, when the samples 0 and 9 are also merged. The process goes on until a dissimilarity of about 5.25, when a single cluster is created. Let’s now dissect the dendrogram horizontally when the dissimilarity is equal to 1.25. Looking at the underlying connections, we discover that the clustering structure is: {6}, {758}, {09410}, {11}, {213}.

Clustering

Therefore, we have five clusters, with two of them made up of a single sample. It’s not surprising to observe that samples 6 and 11 are the last ones to be merged. In fact, they are much further apart than all the other ones. In the following screenshot, four different levels are shown (only the clusters containing more than one sample are marked with a circle):

Clusters generated by cutting the dendrogram at different levels (Ward’s linkage)

As is easy to understand, the agglomeration starts by selecting the most similar clusters/samples and proceeds by adding the nearest neighbors, until the root of the tree has been reached. In our case, at a dissimilarity level equal to 2.0, three well-defined clusters have been detected. The left one is also kept in the next cut, while the two right ones (which are clearly closer) are selected for merging in order to yield a single cluster. The process itself is straightforward and doesn’t need particular explanations; however, there are two important considerations.

The first one is inherent to the dendrogram structure itself. Contrary to other methods, hierarchical clustering allows observing an entire clustering tree, and such a feature is extremely helpful when it’s necessary to show how the process evolves by increasing the dissimilarity level. For example, a product recommender application could not provide any information about the desired number of clusters representing the users, but executive management might be interested in understanding how the merging process is structured and evolves.

In fact, observing how clusters are merged allows deep insight into the underlying geometry and it’s also possible to discover which clusters could potentially be considered as parts of larger ones. In our example, at level 0.5, we have the small cluster {13}. The question of “what samples can be added to this cluster by increasing the dissimilarity?” can be answered immediately with {2}. Of course, in this case, this is a trivial problem that can be solved by looking at the data plot, but with high-dimensional datasets, it can become more difficult without the support of a dendrogram.

The second advantage of dendrograms is the possibility to compare the behavior of different linkage methods. Using Ward’s method, the first mergers happen at quite low dissimilarity levels, but there’s a large gap between five clusters and three clusters. This is a consequence of both the geometry and the merging strategy. What happens, for example, if we employ a single linkage (which is intrinsically very different)? The corresponding dendrogram is shown in the following screenshot:

Dendrogram corresponding to single linkage applied to the dataset

Conclusion

The conclusion is that the dendrogram is asymmetric and the clusters are generally merged with a single sample or with small agglomerates. Starting from the right, we can see that samples {11} and {6} were merged very late. Moreover, sample {6} (which could be an outlier) is merged at the highest dissimilarity, when the final single cluster must be produced. The process can be better understood with the following screenshot:

Clusters generated by cutting the dendrogram at different levels (single linkage)

As you can see from the screenshot, while Ward’s method generates two clusters containing all samples, a single linkage aggregates the largest blocks at Level 1.0 by keeping the potential outliers outside. Therefore, the dendrogram also allows defining aggregation semantics, which can be very helpful in a psychometric and sociological context. While Ward’s linkage proceeds in a way that is very similar to other symmetric algorithms, single linkage has a step-wise fashion that shows an underlying preference for clusters built incrementally, resulting in the avoidance of large gaps in the dissimilarity.

Finally, it’s interesting to note that, while Ward’s linkage yields a potential optimal number of clusters (three) by cutting the dendrogram at Level 3.0, single linkage never reaches such a configuration (because cluster {6} is merged only in the final step). This effect is strictly related to the double principle of maximum separation and maximum cohesion. Ward’s linkage tends to find the most cohesive and separated clusters very quickly. It allows cutting the dendrogram when the dissimilarity gap overcomes a predefined threshold (and, of course, when the desired number of clusters has been reached), while other linkages require a different approach and, sometimes, yield undesirable final configurations.

Considering the nature of the problem, I encourage you to test the behavior of all linkage methods and to find out the most appropriate method for some sample scenarios (for example, the segmentation of the population of a country according to education level, occupancy, and income). This is the best approach to increase awareness and to improve the ability to provide a semantic interpretation of the processes (which is a fundamental goal of any clustering procedure).

If you found this article interesting, you can explore Hands-On Unsupervised Learning with Python to discover the skill-sets required to implement various approaches to Machine Learning with Python. Hands-On Unsupervised Learning with Python will help you explore the concept of unsupervised learning to cluster large sets of data and analyze them repeatedly until the desired outcome is found using Python.

Managing Windows Networking with cmdlets

At the heart of every organization is the network—the infrastructure that enables client and server systems to interoperate. Windows has included networking features since the early days of Windows for Workgroups 3.1 (and earlier with Microsoft LAN Manager).

Many of the tools that IT pros use today have been around for a long time, but have more recently been replaced by PowerShell cmdlets. In this article, we look at some of the old commands and their replacement cmdlets.

New ways to do old things

Networking IT pros in the Windows Server space have been using a number of console applications to perform basic diagnostics for decades. Tools such as Ipconfig, Tracert, and NSlookup are used by IT pros all over the world. The network shell (netsh) is another veritable Swiss Army Knife set of tools to configure and manage Windows networking components.

PowerShell implements a number of cmdlets that do some of the tasks that older Win32 console applications provided. Cmdlets, such as Get-NetIPConfiguration and Resolve-DnsName, are newer alternatives to ipconfig.exe and nslookup.exe.

These cmdlets also add useful functionality. For example, using Test-NetConnection enables you to check whether a host that might block ICMP is supporting inbound traffic on a particular port. ping.exe only uses ICMP, which may be blocked somewhere in the path to the server.

One administrative benefit of using cmdlets rather than older console applications relates to remoting security. With JEA, you can constrain a user to only be able to use certain cmdlets and parameter values. In general, cmdlets make it easier for you to secure servers that are open for remoting.

This article shows you some of the new cmdlets that are available with PowerShell and Windows Server 2019.

Getting ready

This recipe uses two servers: DC1.Reskit.Org and SRV1.Reskit.Org. DC1 is a domain controller in the Reskit.Org domain and SRV1 is a member server. This article assumes that you have already set up DC1 as a domain controller. Run this exercise on SRV1.

How to do it…

  • Examine two ways to retrieve the IP address configuration (ipconfig versus a new cmdlet):
  • Ping a computer:
  • Use the sharing folder from DC1:
  • Share a folder from SRV1:
  • Display the contents of the DNS client cache:
  • Clear the DNS client cache using old and new methods:
  • Perform DNS lookups:

How it works…

In step 1, you examined the old/new way to view the IP configuration of a Windows host using ipconfig.exe and the Get-NetIPConfiguration cmdlet. First, you looked at two variations of using ipconfig.exe, which looks like this:

The Get-NetIPConfiguration cmdlet returns similar information, as follows:

In step 2, you examined the ping.exe command and the newer Test-NetConnection cmdlet. Using these two commands to ping DC1 (from SRV1) looks like this:

The Test-NetConnection cmdlet is also able to do some things that ping.exe cannot do, including testing access to a specific port (as opposed to just using ICMP) on the target host and providing more detailed information about connecting to that remote port, as you can see here:

In step 3, you examined new and old ways to create a drive mapping on the local host (that points to a remotely shared folder). The net.exe command, which has been around since the days of Microsoft LAN Manager, enables you to create and view drive mappings. The SMB cmdlets perform similar functions, as you can see here:

In step 4, you created and viewed an SMB share on SRV1, using both net.exe and the SMB cmdlets. This step looks like this:

DNS is all too often the focus of network troubleshooting activity. The Windows DNS client holds a cache of previously resolved network names and their IP addresses. This avoids Windows systems from having to perform DNS lookups every time a network host name is used. In step 5, you looked at the old and new ways to view the local DNS cache, which looks like this:

One often-used network troubleshooting technique involves clearing the DNS client cache. You can use ipconfig.exe or the Clear-DNSClientCache cmdlet, as shown in step 6. Neither the ipconfig.execommand or the Clear-DNSClientCache cmdlet produce any output.

Another troubleshooting technique involves asking the DNS server to resolve a DNS name. Traditionally, you would have used nslookup.exe. This is replaced with the Resolve-DNSName cmdlet. The two methods that you used in step 7 look like this:

There’s more…

In step 1, you looked at two ways of discovering a host’s IP configuration. Get-NetIPconfiguration, by default, returns the host’s DNS server IP address, whereas ipconfig.exe doesn’t. On the other hand, ipconfig.exe is considerably quicker.

Ping is meant to stand for Packet InterNetwork Groper and has been an important tool to determine whether a remote system is online. ping.exe uses ICMP echo request/reply, but many firewalls block ICMP (it has been an attack vector in the past). The Test-NetConnection cmdlet has the significant benefit that it can test whether the remote host has a particular port open. On the other hand, the host might block ICMP, if the host is to provide a service, for example, SMB shares, then the relevant port has to be open. Thus, Test-NetConnection is a lot more useful for network troubleshooting.

In step 2, you pinged a server. In addition to ping.exe, there are numerous third-party tools that can help you determine whether a server is online. The TCPing application, for example, pings a server on a specific port using TCP/IP by opening and closing a connection on the specified port. You can download this free utility from https://www.elifulkerson.com/projects/tcping.php.

If you found this article interesting, you can explore Windows Server 2019 Automation with PowerShell Cookbook to automate Windows server tasks with the powerful features of the PowerShell Language. Windows Server 2019 Automation with PowerShell Cookbook helps the reader learn how to use PowerShell and manage core roles, features, and services of Windows Server 2019.

Bug Bounty Hunting Basics

Bug bounty hunting is a method for finding flaws and vulnerabilities in web applications; application vendors reward bounties, and so the bug bounty hunter can earn money in the process of doing so. Application vendors pay hackers to detect and identify vulnerabilities in their software, web applications, and mobile applications. Whether it’s a small or a large organization, internal security teams require an external audit from other real-world hackers to test their applications for them. That is the reason they approach vulnerability coordination platforms to provide them with private contractors, also known as bug bounty hunters, to assist them in this regard.

Bug bounty hunters possess a wide range of skills that they use to test applications of different vendors and expose security loopholes in them. Then they produce vulnerability reports and send them to the company that owns the program to fix those flaws quickly. If the report is accepted by the company, the reporter gets paid. There are a few hackers who earn thousands of dollars in a single year by just hunting for vulnerabilities in programs.

The bug bounty program, also known as the vulnerability rewards program (VRP), is a crowd-sourced mechanism that allows companies to pay hackers individually for their work in identifying vulnerabilities in their software. The bug bounty program can be incorporated into an organization’s procedures to facilitate its security audits and vulnerability assessments so that it complements the overall information security strategy. Nowadays, there are a number of software and application vendors that have formed their own bug bounty programs, and they reward hackers who find vulnerabilities in their programs.

The bug bounty reports sent to the teams must have substantial information with proof of concept regarding the vulnerability so that the program owners can replicate the vulnerability as per how the researcher found it. Usually the rewards are subject to the size of the organization, the level of effort put in to identify the vulnerability, the severity of the vulnerability, and the effects on the users.

Statistics state that companies pay more for bugs with high severity than with normal ones. Facebook has paid up to 20,000 USD for a single bug report. Google has a collective record of paying 700,000 USD to researchers who reported vulnerabilities to them. Similarly, Mozilla pays up to 3,000 USD for vulnerabilities. A researcher from the UK called James Forshaw was rewarded 100,000 USD for identifying a vulnerability in Windows 8.1. In 2016, Apple also announced rewards up to 200,000 USD to find flaws in iOS components, such as remote execution with kernel privileges or unauthorized iCloud access.

In this article, we will cover the following topics:

  • Bug bounty hunting platforms
  • Types of bug bounty programs
  • Bug bounty hunter statistics
  • Bug bounty hunting methodology

Bug bounty hunting platforms

A few years ago, if someone found vulnerability in a website, it was not easy to find the right method to contact the web application owners and then too after contacting them it was not guaranteed that they would respond in time or even at all. Then there was also the factor of the web application owners threatening to sue the reporter. All of these problems were solved by vulnerability co-ordination platforms or bug bounty platforms. A bug bounty platform is a platform that manages programs for different companies. The management includes:

  • Reports
  • Communication
  • Reward payments

There are a number of different bug bounty platforms being used by companies nowadays. The top six platforms are explained in the following sections.

HackerOne

HackerOne is a vulnerability collaboration and bug bounty hunting platform that connects companies with hackers. It was one of the first start-ups to commercialize and utilize crowd-sourced security and hackers as a part of its business model, and is the biggest cybersecurity firm of its kind.

Bugcrowd

Bugcrowd Inc. is a company that develops a coordination platform that connects businesses with researchers so as to test their applications. It offers testing solutions for web, mobile, source code, and client-side applications.

Cobalt

Cobalt’s Penetration Testing as a Service (PTaaS) platform converts broken pentest models into a data-driven vulnerability co-ordination engine. Cobalt’s crowdsourced SaaS platform delivers results that help agile teams to pinpoint, track, and remediate vulnerabilities.

Synack

Synack is an American technology company based in Redwood City, California. Synack’s business includes a vulnerability intelligence platform that automates the discovery of exploitable vulnerabilities for reconnaissance and turns them over to the company’s freelance hackers to create vulnerability reports for clients.

Types of bug bounty program

Bug bounty programs come in two different types based on their participation perspectives. This division is based on the bug bounty hunter’s statistics and their level of indulgence overall on a platform. There are two kinds of bug bounty program: public programs and private programs.

Public programs

A public bug bounty program is one that is open to anyone who wants to participate. This program may prohibit some researchers from participating based on the researcher’s level and track record, but in general, anyone can participate in a public bounty program and this includes the scope, the rules of engagement, as well as the bounty guidelines. A public program is accessible by all researchers on the platform, and all bug bounty programs outside of the platforms are also considered bug bounty programs.

Private programs

A private bug bounty program is one that is an invite-only program for selected researchers. This is a program that allows only a few researchers to participate and the researchers are invited based on their skill level and statistics. Private programs only select those researchers who are skilled in testing the kinds of applications that they have. The programs tend to go public after a certain amount of time but some of them may never go public at all. These programs provide access only to those researchers that have a strong track record of reporting good vulnerabilities, so to be invited to good programs, it is required to have a strong and positive record.

Differences

There are a few differences between a public and private program. Conventionally, programs tend to start as private and over time evolve into the public. This is not always true but, mostly, businesses start a private bug bounty program and invite a group of researchers that test their apps before the program goes public to the community. Companies usually consider a few factors before they start a public program. There has to be a defined testing timeline and it is advised that companies initially work with researchers who specialize in that particular area to identify the flaws and vulnerabilities.

Most of the time, the companies do not open their programs to the public and limit the scope of testing as well so as to allow researchers to test these applications specifically in the sections that are critical. This reduces the number of low-severity vulnerabilities in out-of-scope applications. Many organizations use this technique to verify their security posture. Many researchers hunt for bugs in applications mainly for financial gain, so it is crucial that the organization outlines their payout structure within the program’s scope. There are a few questions before anyone would want to start to participate in a bug bounty program; the most important one is What is the end goal of the program going public versus keeping it private?

Bug bounty hunter statistics

A bug bounty hunter’s profile contains substantial information about the track record that helps organizations identify the skill level and skill set of the user. The bug bounty hunter stats include a number of pointers in the profile that indicate the level of the researcher. Different pointers indicate different levels on different platforms. But generally you will see the following pointers and indicators based on which you can judge a researcher’s potential.

Number of vulnerabilities

The first thing you can observe in a researcher’s profile is how many vulnerabilities the researcher has reported in his bug bounty hunting career. This indicated how much the researcher is active on the platform and how many vulnerabilities he has reported to date. A high number of reported vulnerabilities does not usually mean that the researcher has a positive track record and is relative to different factors. That is, if the researcher has 1,000 vulnerabilities submitted over a period of 1 year, the researcher is quite active.

Number of halls of fame

This is the number of programs to which the researcher has reported positive vulnerabilities to. The number of halls of fame is the number of programs the researcher participated in and had valid reports in those programs. A high number of programs means the level of participation of the researcher is active. That is, if the researcher has 150 halls of fame out of a total of 170 programs, the researcher is successful.

Reputation points

This is a relatively new indicator and it differs from platform to platform. Reputation points are points awarded for valid reports. It is a combination of the severity of the report, the bounty awarded to the report, and the bonus bounty of the report. That is, if the researcher has 8,000 reputation points over time, then he is above average.

Signal

A signal is an aggregate representation of report validity. It is basically a point-based system that represents how many invalid reports the researcher has submitted. A signal is calculated out of 10.

Impact

Impact is a representation of the average bounty awarded per report. It is an aggregate of the total bounty that was awarded for every report that was filed.

Accuracy

This is a percent-based system that indicates the number of accepted reports divided by the number of total reports. This tells the program owners how much success rate the researcher has in reporting vulnerabilities. If researcher A has 91% accuracy rate, he submits reports that are mostly valid.

Bug bounty hunting methodology

Every bug bounty hunter has a different methodology for hunting vulnerabilities and it normally varies from person to person. It takes a while for a researcher to develop their own methodology and lots of experimentation as well. However, once you get the hang of it, it is a self-driven process. The methodology of bug bounty hunting that I usually follow looks something like this:

  1. Analyzing the scope of the program: The scope guidelines have been clearly discussed in the previous chapters. This is the basic task that has to be done. The scope is the most important aspect of a bug bounty program because it tells you which assets to test and you don’t want to spend time testing out-of-scope domains. The scope also tells you which are the most recent targets and which are the ones that can be tested to speed up your bounty process.
  2. Looking for valid targets: Sometimes the program does not necessarily have the entire infrastructure in its scope and there are just a number of apps or domains that are in the scope of the program. Valid targets are targets that help you quickly test for vulnerabilities in the scope and reduce time wasting.
  3. High-level testing of discovered targets: The next thing to do is a quick overview of targets. This is usually done via automated scanning. This basically tells the researchers whether the targets have been tested before or have they been tested a long time ago. If automated scanning does not reveal vulnerabilities or flaws within a web application or a mobile application, it is likely that the application has been tested by researchers before. Nonetheless, it is still advised to test that application one way or another, as this reveals the application’s flaws in detail.
  4. Reviewing all applications: This is a stage where you review all the applications and select the ones based on your skill set. For instance, Google has a number of applications; some of them are coded in Ruby on Rails, some of them are coded in Python. Doing a brief recon on each application of Google will reveal which application is worth testing based on your skill set and level of experience. The method of reviewing all the applications is mostly information gathering and reconnaissance.
  5. Fuzzing for errors to expose flaws: Fuzzing is termed as iteration; the fastest way to hack an application is to test all of its input parameters. Fuzzing takes place at the input parameters and is a method of iterating different payloads at different parameters to observe responses. When testing for SQL injection vulnerabilities and cross-site scripting vulnerabilities, fuzzing is the most powerful method to learn about errors and exposure of flaws. It is also used to map an application’s backend structure.
  6. Exploiting vulnerabilities to generate POCs: By fuzzing, we identify the vulnerabilities. In other scenarios, vulnerability identification is just one aspect of it. In bug bounty hunting, vulnerabilities have to be exploited constructively to generate strong proof of concepts so that the report is considered in high regard. A well explained the proof of concepts will accelerate the review process. In conventional penetration tests, vulnerability exploitation is not that important, but in bug bounty hunting, the stronger the proof of concept, the better the reward.

If you found this article interesting and insightful, and would like to learn more, you must explore Bug Bounty Hunting Essentials. A simple and end-to-end guide, Bug Bounty Hunting Essentials is for white-hat hackers or anyone who wants to understand the concept behind bug bounty hunting and understand this brilliant way of penetration testing.

How to Do Math in Bash

BASH, “In the beginning… was the command line” Years ago, we didn’t have fancy frameworks that handled our distributed computing for us, or applications that could read files intelligently and give us accurate results. If we did, it was very expensive or only worked for a small problem set, very few people had access to this technology, and it was mostly proprietary.

For newcomers to the world of data science, you might have used the command line for a small number of things. Maybe you moved a file from one place to another using mv or read a file using cat. Or you might have never used the command line at all or at least not for data science. In this article, we hope to show some tools and ways you can perform some Math in bash.

Why bash?

We’re focusing on Bourne-again shell (bash) for multiple reasons. First, it’s the most popular shell and you can find it everywhere. In fact, for the majority of Linux distributions, bash is the default shell. It’s a great first shell to learn and very easy to work with. There are a number of examples and resources available to help you with bash if you ever get stuck. From a bare-metal installation in a data center to an instance running in the cloud, bash is there, installed, and waiting for input.

There are a number of other shells you can choose from, such as the Z shell (zsh). The Z shell is fairly new (and by new I mean released in 1990, which is new in shell land) and provides a number of powerful features. Other notable shells are tcshksh, and fish. The C Shell (tcsh), the Korn Shell (ksh), and the Friendly Interactive Shell (fish) are still widely used today. FreeBSD has made tcsh its default shell for the root user and ksh is still used for a lot of Solaris operating systems. Fish is also a great starter shell with a lot of features to help the user navigate the shell without feeling lost.

While these shells are still very powerful and stable, we will be focusing on using bash, as we want to focus on consistency across multiple platforms and help you learn a very active and popular shell that’s been around for 30 years.

Math in bash itself

Bash itself is able to do simple integer arithmetic. When a little more capability is required, two command-line tools, bc and awk, are capable of doing many types of calculations. There are at least three different ways to accomplish this in bash.

Using let

You can use the let command to do simple bash arithmetic:

Basic arithmetic

You can do addition, subtraction, multiplication (be sure to escape the * operator with \*) and integer division:

The numbers must be separated by spaces.

Double-parentheses

Similar to let, you can do simple integer arithmetic in bash using doubled parentheses:

To see the full range of operations available in the shell, check out the GNU reference page at https://www.gnu.org/software/bash/manual/html_node/Shell-Arithmetic.html.

bc, the unix basic calculator

bc is a calculator scripting language. Scripts in bc can be executed with the bc command. Imagine a test.bc file contains the following code:

That means you can run bc like this:

bc can do far more than just divide two numbers. It’s a fully-fledged scripting language on its own and you can do arbitrarily complex things with a bc script. A bc script might be the ending point of a pipeline of data, where, initially, the data files are massaged into a stream of data rows, and then a bc script is used to compute the values we’re looking for. Let’s illustrate this with a simple example.

In this example, we need to take a CSV data file and compute the average of the second number in each row and also compute the sum of the fourth number in each row, say we have a bc function to compute something interesting on these two numbers such as a harmonic mean. We can use awk to output the numbers into a bc script and then feed the result into bc using a pipe.

So, say our bc function to compute the harmonic mean of two numbers looks like this:

We can use awk to find the two numbers and construct the bc script and then pipe it to bc to execute:

See the bc documentation at https://www.gnu.org/software/bc/manual/html_mono/bc.html for more things you could do with bc.

Math in (g)awk

awk (including the gnu implementation, gawk) is designed to stream text processing, data extraction, and reporting. A large percentage of practical statistics is made up of counting things in specific ways, and this is one of the things awk excels at. Tallying totals, histograms, and grouped counts are all very easy in awk.

An awk program is structured as a set of patterns that are matched and actions to take when those patterns are matched:

For each record (usually each line of text passed to awk), each pattern is tested to see whether the record matches, and if so, the action is taken. Additionally, each record is automatically split into a list of fields by a delimiter. The default action, if none is given, is to print the record. The default pattern is to match everything. There are two special patterns, BEGIN and END, which are matched only before any records are processed, or after, respectively.

The power of awk lies in its variables: variables can be used without a declaration. There are some special variables already available to you that are useful for math:

Additionally, you can assign values to your own variables. awk natively supplies variables that can hold strings, integers, floating point numbers, and regular expressions and associative arrays.

As an example, say we want to count the word frequency in the reviews of our test data. Run this code:

It will produce these results:


Counting the word frequency in the reviews of our test data

Say we’d like to compute a histogram of the star values of the reviews. This is also very easy with awk:

The preceding code produces this:


Computing a histogram of the star values of the reviews

We can see that four- and five-star reviews dominate this dataset. Besides counting, awk is also great for manipulating the format of strings.

This article showed how the command line offers several options for doing arithmetic and other mathematical operations. Simple arithmetic and grouped tallies can be performed using bash itself or awk. If you found it interesting, you can explore Hands-On Data Science with the Command Line for big data processing and analytics at speed and scale using command line tools. Hands-On Data Science with the Command Line will help you learn how to speed up the process and perform automated tasks using command-line tools.

RESTful APIs and Testing

A software application product has various software layers such, as the user interface (UI), the business logic layer, middleware, and a database. Testing and certification primarily focuses on data integration tests on the Business layer. API testing is software testing that involves direct API testing, unlike other generic tests, which primarily involve the UI:

The preceding diagram depicts the typical layers of software, with API testing on the Business layer and the functional or UI testing on the Presentation layer.

Understanding API testing approaches

Agreeing on an approach for API testing when beginning API development is an essential API strategy. Let’s look at a few principles of API testing:

  • Clear definition of the scope and a good understanding of the functionality of the API
  • Common testing methodologies such as boundary analysis and equivalence classes are part of API test cases
  • Plan, define, and be ready with input parameters, zero, and sample data for the API
  • Determine and compare expected and actual results, and ensure that there are no differences

API testing types

In this section, we will review the various categories of API testing.

Unit tests

Tests that involve the validation of individual operations are unit tests. The following is one of the sample code snippets of a specific unit test case that validates getting all the investors from the API:

API validation tests

All software needs quick evaluation and to assert its purpose of creation. The validation tests need to be run for every function that is developed, at the end of the development process. Unlike unit tests, which focus on particular pieces or functions of the API, validation tests are a higher-level consideration, answering a set of questions so that the development can move on to the next phase.

A set of questions for validation tests could be the following:

  • A product-specific question, such as, is it the necessary function that is asked for?
  • A behavioral question, such as, is the developed function doing what is intended?
  • An efficiency-related question, such as, is the intended function using the necessary code, in an independent and optimized manner?

All of these questions, in essence, serve to validate the API in line with the agreed acceptance criteria and also to ensure its adherence to standards regarding the delivery of expected end goals and meeting user needs and requirements flawlessly.

Functional tests

Tests that involve specific functions of the APIs and their code base are functional tests. Validating the count of active users through the API, regression tests, and test case execution come under functional tests. The following screenshot demonstrates one such functional testing example of investor service validation for user authentication:

UI or end-to-end tests

Tests that involve and assert end-to-end scenarios, including GUI functions and API functions, which in most of the cases, validate every transaction of an application, are grouped under end-to-end tests.

Load testing

As we know, an increase in the number of end users should not affect the performance of the functions of an application. Load testing will uncover such issues and also validate the performance of an API in normal conditions too.

Runtime error detection tests

Tests that help monitor the application and detect problems such as race conditions, exceptions, and resource leaks belong in the runtime error tests category. The following points capture a brief about those factors.

Monitoring APIs

Tests for various implementation errors, handler failures, and other inherent concerns inside the API code base and ensures it does not have any holes that would lead to application insecurity.

Execution errors

Valid requests to the API return responses and asserting them for expected valid responses is common; however, asserting invalid requests for expected failures is also essential as part of an API testing strategy and those tests come under execution errors:

The preceding screenshot depicts an example of expecting an error when the user gives an ID that is not present on the system.

Resource leaks

Negative tests to validate the underlying API resource malfunctions by submitting invalid requests to the API. The resources, in this case, are memory, data, insecurities, timeout operations, and so on.

Error detection

Detect network communication failures. Authentication failures from giving the wrong credentials is an example error detection scenario. These are tests ensure the errors are captured and then resolved as well:

Here’s an authentication error, and the previous screenshot depicts this, as the code returns 401 (as it should); this is an example of an error detection test.

If you found this article interesting, you can explore Hands-On RESTful API Design Patterns and Best Practices to build effective RESTful APIs for enterprise with design patterns and REST framework’s out-of-the-box capabilities. Hands-On RESTful API Design Patterns and Best Practices helps you explore the concepts of service-oriented architecture (SOA), event-driven architecture (EDA), and resource-oriented architecture (ROA).