Data Validation • citrus

The validate function checks if the input data adheres to the expected format for modelling. Both unsupervised and supervised methods require a customer level data frame with one row per customer.

This verification step is particularly important if the user skips the preprocessing step and passes their own preprocessed table.

output <- segment(citrus::preprocessed_data, 
                  modeltype = 'tree',
                  steps = c('model'),
                  prettify = TRUE,
                  print_plot = TRUE)

Data Validation Example 1

The DF must have a column called id.

invalid_df <- formatted %>% 
  select(-id)

validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters)
#> Error in validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters): 
#> 
#> Columns missing: id

Data Validation Example 2

The DF must have one observation per customer.

invalid_df <- formatted %>% 
  rbind(formatted[rep(1, 5), ])

validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters)
#> Error in validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters): 
#> 
#> ID observations are not unique. nrow(df) > n_distinct(df$id).

Data Validation Example 3

If the supervised model is selected a response column is required. The name of the response column can be arbitrary but needs to be specified in the dependent_variable hyperparameter. If not specified the validate function searches for a column called response.

invalid_df <- formatted %>% 
  select(-response)
validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters)
#> Error in validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters): 
#> 
#> Columns missing: response

Data Validation Example 3

If id, and response column exist but there are no feature columns to predict over, an error is raised.

invalid_df <- formatted %>% 
  select(id, response)

validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters)
#> Error in validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters): 
#> 
#> The dataframe does not contain any feature columns.