Preprocessing.Rmd
The preprocess
function transforms a transactional table into a customer aggregated table with custom options for aggregation methods for numeric and categorical columns.
In the current state we assume that the input data is a transaction table with 4 columns required (transactionid, id, orderdate, transactionvalue). The function will raise an error if these column names are not found.
formatted <- preprocess(citrus::transactional_data)
If nothing other than numeric_operation_list = NA, the preprocessing will default to RFM preprocessing if no categorical columns exist in the data set. It can be used in e.g. unsupervised learning.
formatted <- preprocess(citrus::transactional_data, numeric_operation_list = NA)
The numeric columns can be aggregated using various aggregation methods like e.g. ‘min’ and standard deviation ‘sd’.
formatted <- preprocess(citrus::transactional_data,
numeric_operation_list = c('min', 'sd')
)
When the table contains categorical columns, the preprocess function check if any categorical column exists and assign the most common category for each column to each user.
formatted <- preprocess(citrus::transactional_data,
target = "transactionvalue"
)
#> Calculating target values
It is possible to pass a list of categorical columns to include in the final table. In this example the function uses the most common category for each user in the column ‘country’.
formatted <- preprocess(citrus::transactional_data,
categories = c("country"),
target = "transactionvalue"
)
#> Calculating target values
Specify the target variable and the aggregation function to perform on it if you want to run a supervised model. In the example the target column is the mean of the transactionvalue column.
formatted <- preprocess(citrus::transactional_data,
categories = c('country'),
numeric_operation_list = c('min', 'sd'),
target = 'transactionvalue',
target_agg = 'mean')
#> Calculating target values