How to Map and Create New Columns
Mapping and adding columns using our built-in widget is a powerful feature you can use to build your data pipelines
In Data Pipelines it is possible to map (update) and add new columns to a dataset containing either a literal value or a value derived from other columns. This is done by using the 'Add / Map column' widget in the pipeline builder (Figure 1.).
data:image/s3,"s3://crabby-images/2dcdc/2dcdc11e0faf6ab65ba8992678edbd543124e9f4" alt=""
We will be using the dataset in Figure 2. for this demo:
data:image/s3,"s3://crabby-images/1f355/1f355f10913211eced991244600ab0fac286086f" alt=""
Multiple columns can simultaneously be mapped and added using the widget. In Figure 3. the widget is configured to update the year
column by multiplying its values by 2 and add a new column containing the literal 'DP Demo'.
data:image/s3,"s3://crabby-images/a61dd/a61dd9e9ec005dbf728641a7c9a936e7a385bd1d" alt=""
Notice that the year
column already exists in the dataset whereas the my_new_column
does not. After updating the pipeline preview by clicking the Preview button the result will look like Figure 4.
data:image/s3,"s3://crabby-images/cf9f8/cf9f8cbff3b00cca7f58614761273ad88b85a0e0" alt=""
Notice the following:
- the values in the
year
column have been multiplied by two - a new column named
my_new_column
has been added containing the literal value 'DP Demo'
When mapping a column using an expression, any Spark SQL function can be used. For example, let's use the concat()
function to append the the set_num
column to the the name
column with a space in between. The operation widget will look like Figure 5.
data:image/s3,"s3://crabby-images/56f61/56f6106236320176c91436d1ea1a632597f7882e" alt=""
concat()
Spark SQL function to map the name
columnThe result will look like Figure 6.
data:image/s3,"s3://crabby-images/399d7/399d7df131ed6670687a42d7b8720b319c2f40df" alt=""
name
column with set_num
concatenated to itNote how the values in the name
column had the values from set_num
appended to them with a space in between.
Mapping columns this way is a powerful feature in Data Pipelines. All of Spark's built-in functions are available when using expressions.