Using Custom Cross-Validation Plans with vtreat

March 9, 2020 ยท View on GitHub

Nina Zumel, John Mount March 2020

These are notes on controlling the cross-validation plan in the Python version of vtreat, for notes on the R version of vtreat, please see here.

Using Custom Cross-Validation Plans with vtreat

By default, Python vtreat uses a y-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in vtreat.

import pandas
import numpy
import numpy.random

import vtreat
import vtreat.cross_plan

Example: Highly Unbalanced Class Outcomes

As an example, suppose you have data where the target class of interest is relatively rare; in this case about 5%:

n_row = 1000

numpy.random.seed(2020)

d = pandas.DataFrame({
    'x': numpy.random.normal(size=n_row),
    'y': numpy.random.binomial(size=n_row, p=0.05, n=1)
})

d.describe()
x y
count 1000.000000 1000.000000
mean -0.033441 0.054000
std 0.974859 0.226131
min -2.870341 0.000000
25% -0.693371 0.000000
50% -0.033076 0.000000
75% 0.593758 0.000000
max 3.099762 1.000000

First, try preparing this data using vtreat.

By default, Python vtreat uses a y-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables.

Here we start with the default k-way y-stratified cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in vtreat.

#
# create the treatment plan
#

k = 5 # number of cross-val folds (actually, the default)
treatment_stratified = vtreat.BinomialOutcomeTreatment(
    var_list=['x'],
    outcome_name='y',
    outcome_target=1,
    params=vtreat.vtreat_parameters({
        'cross_validation_k': k,
        'retain_cross_plan': True,
    })
)

# prepare the training data
prepared_stratified = treatment_stratified.fit_transform(d, d['y'])

Let's look at the distribution of the target outcome in each of the cross-validation groups:

# convenience function to mark the cross-validation group of each row
def label_rows(df, cross_plan, *, label_column = 'group'):
    df[label_column] = 0
    for i in range(len(cross_plan)):
        app = cross_plan[i]['app']
        df.loc[app, label_column] = i
            
# label the rows            
label_rows(prepared_stratified, treatment_stratified.cross_plan_)
# print(prepared_stratified.head())

# get some summary statistics on the data
stratified_summary = prepared_stratified.groupby(['group']).agg({'y': ['sum', 'mean', 'count']})
stratified_summary.columns = stratified_summary.columns.get_level_values(1)
stratified_summary
sum mean count
group
0 10 0.050 200
1 11 0.055 200
2 12 0.060 200
3 8 0.040 200
4 13 0.065 200
# standard deviation of target prevalence per cross-val fold
std_stratified = numpy.std(stratified_summary['mean'])
std_stratified 
0.008602325267042627

Explicitly Controlling the Sampler

A user chosen cross validation plan generator can be passed in as follows. Also to retain the plan for later inspection, set the 'retain_cross_plan' parameter. The passed in class should be derived from vtreat.cross_plan.CrossValidationPlan.

class KWayCrossPlan(vtreat.cross_plan.CrossValidationPlan):
    """K-way cross validation plan"""

    def __init__(self):
        vtreat.cross_plan.CrossValidationPlan.__init__(self)

    # create a custom cross-plan generator
    # noinspection PyMethodMayBeStatic
    def _k_way_cross_plan(self, n_rows, k_folds):
        """randomly split range(n_rows) into k_folds disjoint groups"""
        # first assign groups modulo k (ensuring at least one in each group)
        grp = [i % k_folds for i in range(n_rows)]
        # now shuffle
        numpy.random.shuffle(grp)
        plan = [
            {
                "train": [i for i in range(n_rows) if grp[i] != j],
                "app": [i for i in range(n_rows) if grp[i] == j],
            }
            for j in range(k_folds)
        ]
        return plan
    
    def split_plan(self, *, n_rows=None, k_folds=None, data=None, y=None):
        if n_rows is None:
            raise ValueError("n_rows must not be None")
        if k_folds is None:
            raise ValueError("k_folds must not be None")
        return self._k_way_cross_plan(n_rows=n_rows, k_folds=k_folds)


# create the treatment plan
treatment_unstratified = vtreat.BinomialOutcomeTreatment(
    var_list=['x'],
    outcome_name='y',
    outcome_target=1,
    params=vtreat.vtreat_parameters({
        'cross_validation_plan': KWayCrossPlan(),
        'cross_validation_k': k,
        'retain_cross_plan': True,
    })
)

# prepare the training data
prepared_unstratified = treatment_unstratified.fit_transform(d, d['y'])
# get some summary statistics on the data
label_rows(prepared_unstratified, treatment_unstratified.cross_plan_)
unstratified_summary = prepared_unstratified.groupby(['group']).agg({'y': ['sum', 'mean', 'count']})
unstratified_summary.columns = unstratified_summary.columns.get_level_values(1)
unstratified_summary
sum mean count
group
0 8 0.040 200
1 9 0.045 200
2 15 0.075 200
3 8 0.040 200
4 14 0.070 200
# standard deviation of target prevalence per cross-val fold
std_unstratified = numpy.std(unstratified_summary['mean'])
std_unstratified 
0.015297058540778355

Notice the between group y-variances are about 70% larger in the unstratified sampling plan than in the stratified sampling plan.

std_unstratified/std_stratified
1.7782469350914576

Other cross-validation schemes

If you want to cross-validate under another scheme--for example, stratifying on the prevalences on an input class--you can write your own custom cross-validation scheme and pass it into vtreat in a similar fashion as above. Your cross-validation scheme must extend vtreat's CrossValidationPlan class.

Another benefit of explicit cross-validation plans is that one can use the same cross-validation plan for both the variable design and later modeling steps. This can limit data leaks across the cross-validation folds.

Other predefined cross-validation schemes

In addition to the y-stratified cross validation, vtreat also defines a time-oriented cross validation scheme (OrderedCrossPlan). The ordered cross plan treats time as the grouping variable. For each fold, all the datums in the application set (the datums that the model will be applied to) come from the same time period. All the datums in the training set come from one side of the application set; that is all the training data will be either earlier or later than the data in the application set. Ordered cross plans are useful when modeling time-oriented data.

Note: it is important to not use leave-one-out cross-validation when using nested or stacked modeling concepts (such as seen in vtreat), we have some notes on this here.