How to Map and Create New Columns

In Data Pipelines it is possible to map (update) and add new columns to a dataset containing either a literal value or a value derived from other columns. This is done by using the 'Add / Map column' widget in the pipeline builder (Figure 1.).

Figure 1. the Add / Map column widget

We will be using the dataset in Figure 2. for this demo:

Figure 2. example dataset containing some text and integer columns

Multiple columns can simultaneously be mapped and added using the widget. In Figure 3. the widget is configured to update the year column by multiplying its values by 2 and add a new column containing the literal 'DP Demo'.

Figure 3. The Add / Map column widget configured to update a column and add a new one.

Notice that the year column already exists in the dataset whereas the my_new_column does not. After updating the pipeline preview by clicking the Preview button the result will look like Figure 4.

Figure 4. The resulting dataset after the Add / Map column operation.

Notice the following:

  • the values in the year column have been multiplied by two
  • a new column named my_new_column has been added containing the literal value 'DP Demo'

When mapping a column using an expression, any Spark SQL function can be used. For example, let's use the concat() function to append the the set_num column to the the name column with a space in between. The operation widget will look like Figure 5.

Figure 5. using the concat() Spark SQL function to map the name column

The result will look like Figure 6.

Figure 6. the name column with set_num concatenated to it

Note how the values in the name column had the values from set_num appended to them with a space in between.

Mapping columns this way is a powerful feature in Data Pipelines. All of Spark's built-in functions are available when using expressions.