Transforming Data in Different Ways in Pentaho

The set of operations covered in this tutorial is not a full list of the available options, but includes the most common ones, and will inspire you when you come to implement others.

Note that the files used in this tutorial were built with data downloaded from www.numbeo.com, a site containing information about living conditions in cities and countries worldwide. Before continuing, make sure you download the set of data from https://github.com/PacktPublishing/Pentaho-Data-Integration-Quick-Start-Guide.

Extracting data from existing fields

First, you’ll learn how to extract data from fields that exist in your dataset in order to generate new fields. For the first exercise, you’ll read a file containing data about the cost of living in Europe. The content of the file looks like this:

Rank City Cost of Living Index Rent Index Cost of Living Plus Rent Index Groceries Index Restaurant Price Index Local Purchasing Power Index

1 Zurich, Switzerland 141.25 66.14 105.03 149.86 135.76 142.70

2 Geneva, Switzerland 134.83 71.70 104.38 138.98 129.74 130.96

3 Basel, Switzerland 130.68 49.68 91.61 127.54 127.22 139.01

4 Bern, Switzerland 128.03 43.57 87.30 132.70 119.48 112.71

5 Lausanne, Switzerland 127.50 52.32 91.24 126.59 132.12 127.95

6 Reykjavik, Iceland 123.78 57.25 91.70 118.15 133.19 88.95

...

As you can see, the city field also contains the country name. The purpose of this exercise is to extract the country name from this field. In order to do this, go through the following steps:

  1. Create a new transformation and use a Text file input step to read the cost_of_living_europe.txt file.
  2. Drag a Split Fields step from the Transform category and create a hop from the Text file input towards the Split Fields step.
  3. Double-click the step and configure it, as shown in the following screenshot:
  4. Close the window and run a preview. You’ll see the following:

As you can see, the Split Fields step can be used to split the value of a field into two or more new fields. This step is perfect for the purpose of obtaining the country name because the values were easy to parse. You had a value, then a comma, then another value. This is not always the case, but PDI has other steps for doing similar tasks. Take a look at another method for extracting pieces from a field.

This time, you’ll read a file containing common daily food items and their prices. The file has two fields—food and price—and looks as follows:

Food Price

Milk (regular), (0.25 liter) 0.19 €

Loaf of Fresh White Bread (125.00 g) 0.24 €

Rice (white), (0.10 kg) 0.09 €

Eggs (regular) (2.40) 0.33 €

Local Cheese (0.10 kg) 0.89 €

Chicken Breasts (Boneless, Skinless), (0.15 kg) 0.86 €

...

Suppose that you want to split the Food field into three fields for the name, quantity, and number of units respectively. Taking the value in the first row, Milk (regular), (0.25 liter), as an example, the name would be Milk (regular), the quantity would be0.25, and the unit would be liter. You cannot solve this as you did before, but you can use regular expressions instead. In this case, the expression to use will be (.+)\(([0-9.]+)( liter| g| kg| head|)\).*.

Try it using the following steps:

  1. Create a new transformation and use a Text file input step to read the recommended_food.txt file. In order to define the Price as a number, use the format #.00 €.
  2. Drag a Regex Evaluation step from the Scripting category and create a hop from the Text file input toward this new step.
  3. Double-click the step and configure it as shown in the following screenshot. Don’t forget to check the Create fields for capture groups option:
  4. Close the window and run a preview. You will see the following:

The RegEx Evaluation step can be used just to evaluate whether a field matches a regular expression, or to generate new fields, as in this case. By capturing groups, you could create a new field for each group captured from the original field. You will also notice a field named result, which, in this example, has Y as its value. This Y means that the original field matched the given expression.

Note that while the Split Fields step removes the original field from the dataset, the RegEx Evaluation step does not.

These are not the only steps that will allow this kind of operation.

More ways to create new fields

Besides just extracting data from the incoming fields, you can also combine the fields by performing arithmetic operations between them, concatenating String fields, and using other methods. Just as in the previous section, this section will expand on a simple example that will serve you as a model for creating your own process.

For this tutorial, you’ll continue using the file containing data about the cost of living. This time, you’ll generate a new field that creates a new index out of the average of the restaurant price index and the groceries index. To do this, go through the following steps:

  1. Create a new transformation and use a Text file input step to read the cost_of_living_europe.txt file.
  2. Drag a Calculator step from the Transform category and create a hop from the Text file input toward the calculator.
  3. Double-click the step and configure it as shown in the following screenshot:
  4. Close the window and run a preview. You will see the following:

As you can deduce from the configuration window, with the Calculator step, you can create new fields by using temporary fields in the way. In the final dataset, you can see each temporary field—two and temp, in the example—as a new column.

The Calculator step is a handy step that can be used for performing many common types of operations, such as arithmetic, string, and date operations, among others.

Of course, there is a simpler way for doing the calculation in the last transformation:

  1. Save the previous transformation under a different name.
  2. Remove the Calculator step. You can do this just by selecting it and pressing Delete.
  3. Drag and drop a User Defined Java Expression step from the Scripting folder. Create a hop from the Text file input step toward this new step.
  4. Double-click the step and configure it as shown in the following screenshot:
  5. Close the window and run a preview. You should see exactly the same result as before.

The Java Expression step is a powerful step that allows you to create fields of any type—not just numbers—by using a wide variety of expressions, as long as they can be expressed in a single line using Java syntax.

In the last example, using the Java Expression step was simpler than doing the same with a Calculator step. Depending on the case, it can be more convenient to use one or the other.

This was just an example that showed you how to add new fields based on the fields in your dataset. There are many steps available, and with different purposes. You will find them mainly in the Transform folder, but there are some others in different folders in the Design tab. No matter which step you pick, the way you use it is always the same. You add the step at the end of the stream and then configure it properly according to your needs.

If you found this article interesting and helpful, you can check out Pentaho Data Integration Quick Start Guide. Featuring simplified and easy-to-follow examples, Pentaho Data Integration Quick Start Guide takes you through the underlying concepts in a lucid manner, so you can create efficient ETL processes using Pentaho.

Leave a Reply