The validate
function checks if the input data adheres to the expected format for modelling. Both unsupervised and supervised methods require a customer level data frame with one row per customer.
This verification step is particularly important if the user skips the preprocessing step and passes their own preprocessed table.
Data Validation Example 1
The DF must have a column called id
.
invalid_df <- formatted %>%
select(-id)
validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters)
#> Error in validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters):
#>
#> Columns missing: id
Data Validation Example 2
The DF must have one observation per customer.
invalid_df <- formatted %>%
rbind(formatted[rep(1, 5), ])
validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters)
#> Error in validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters):
#>
#> ID observations are not unique. nrow(df) > n_distinct(df$id).
Data Validation Example 3
If the supervised model is selected a response column is required. The name of the response column can be arbitrary but needs to be specified in the dependent_variable
hyperparameter. If not specified the validate function searches for a column called response
.
invalid_df <- formatted %>%
select(-response)
validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters)
#> Error in validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters):
#>
#> Columns missing: response
Data Validation Example 3
If id
, and response
column exist but there are no feature columns to predict over, an error is raised.
invalid_df <- formatted %>%
select(id, response)
validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters)
#> Error in validate(invalid_df, supervised = TRUE, hyperparameters = hyperparameters):
#>
#> The dataframe does not contain any feature columns.