How to Map and Create New Columns
In Data Pipelines it is possible to map (update) and add new columns to a dataset containing either a literal value or a value derived from other columns. This is done by using the 'Add / Map column' widget in the pipeline builder (Figure 1.).
We will be using the dataset in Figure 2. for this demo:
Multiple columns can simultaneously be mapped and added using the widget. In Figure 3. the widget is configured to update the year
column by multiplying its values by 2 and add a new column containing the literal 'DP Demo'.
Notice that the year
column already exists in the dataset whereas the my_new_column
does not. After updating the pipeline preview by clicking the Preview button the result will look like Figure 4.
Notice the following:
- the values in the
year
column have been multiplied by two - a new column named
my_new_column
has been added containing the literal value 'DP Demo'
When mapping a column using an expression, any Spark SQL function can be used. For example, let's use the concat()
function to append the the set_num
column to the the name
column with a space in between. The operation widget will look like Figure 5.
concat()
Spark SQL function to map the name
columnThe result will look like Figure 6.
name
column with set_num
concatenated to itNote how the values in the name
column had the values from set_num
appended to them with a space in between.
Mapping columns this way is a powerful feature in Data Pipelines. All of Spark's built-in functions are available when using expressions.