How to Map and Create New Columns
In Data Pipelines it is possible to map (update) and add new columns to a dataset containing either a literal value or a value derived from other columns. This is done by using the 'Add / Map column' widget in the pipeline builder (Figure 1.).
We will be using the dataset in Figure 2. for this demo:
Multiple columns can simultaneously be mapped and added using the widget. In Figure 3. the widget is configured to update the year
column by multiplying its values by 2 and add a new column containing the literal 'DP Demo'.
Notice that the year
column already exists in the dataset whereas the my_new_column
does not. After updating the pipeline preview by clicking the Preview button the result will look like Figure 4.
Notice the following:
- the values in the
year
column have been multiplied by two - a new column named
my_new_column
has been added containing the literal value 'DP Demo'
When mapping a column using an expression, any Spark SQL function can be used. For example, let's use the concat()
function to append the the set_num
column to the the name
column with a space in between. The operation widget will look like Figure 5.
The result will look like Figure 6.
Note how the values in the name
column had the values from set_num
appended to them with a space in between.
Mapping columns this way is a powerful feature in Data Pipelines. All of Spark's built-in functions are available when using expressions.