Validator Setup

Working with machine learning can be broadly divided into 3 phases: training, testing, and application. On this page we begin to discuss testing with pybdt.

Testing and application are related in that the performance of a classifier is first tested by applying it to event ensembles of known classes. Scores are found not only for the training samples but also for separate testing samples which were set aside prior to training. We can confirm that overtraining has been sufficiently suppressed by comparing the performance on the training and testing samples. The overall quality of the classifier can be quantified in terms of signal to noise ratio after applying a cut on the classifier output. If the training background sample was actually a background-dominated experimental dataset, we can also evaluate the data/MC agreement in the testing samples.

pybdt provides a class, pybdt.validate.Validator, for performing some of the most common classifier tests. The recommended usage is to write two scripts or subscripts (if it is your style to write scripts with subcommands): one for setting up the validator and another for using it to generate plots. This division of tasks is useful because per-event scores are calculated and stored during the setup step. Once this (relatively) expensive step is complete, plotting step can be tuned to your preferred style.

Validator Construction

Most possibilities for pybdt.validate.Validator setup can be found in the example script in pybdt/resources/examples/setup_sample_validator.py. Validator setup consists broadly of three steps: 1) construct a Validator for some classifier; 2) add DataSets to the Validator; 3) specify weighting schemes and their plotting styles.

In order to reduce data copying, the Validator includes a mostly transparent abstraction layer that allows the classifier and DataSets to be referenced by filename rather than stored directly as Validator member data. This method is usually most convenient, with the caveat that no straightforward interface is provided for updating the filename references if the classifier or DataSets move; if you must reorganize your file tree, it will probably be simplest to recreate the Validator from scratch.

If some bdt is already loaded in memory, then a Validator can be constructed as:

from pybdt.validate import Validator
v = Validator (bdt)

However, since the classifier is typically already stored on disk – say, with a filename bdt_filename – the Validator can simply refer to this file:

v = Validator (bdt_filename)

Next we tell the Validator about the DataSets we are interested in, e.g.:

v.add_data ('bg', 'datasets/bg.ds', 'Background sim')
v.add_data ('train_sig', 'datasets/train_sig.ds', 'Training signal sim')
v.add_data ('train_data', 'datasets/train_bg.ds', 'Training data')
v.add_data ('test_sig', 'datasets/test_sig.ds', 'Testing signal sim')
v.add_data ('test_data', 'datasets/test_bg.ds', 'Testing data')

and so on. In each case, we provide a DataSet identifier, the file path, and a default label for plotting (more on this “default” later). Note that in a real analysis, it’s best to specify an absolute rather than relative file path.

Each pybdt.validate.Validator.add_data() call causes the Validator to calculate and store per-event scores for later use.

DataSet weighting

Once the DataSets are added to the Validator, the allowed weightings can be specified. For example, in the ABC example, the signal training sample is weighted like so:

v.add_weighting ('weight', 'train_sig', color='cyan')

The first argument is the column of train_sig that contains the weights. The desired plotting style can be specified in keyword arguments (see pybdt.histlite.Style).

For an experimental DataSet, the weight is simply 1/livetime; this case can be handled, e.g., like:

v.add_weighting ('livetime', 'train_data',
    line=False, markers=True, marker='.', color='.5', errorbars=True)

Note that livetime uses the pybdt.ml.DataSet.livetime property to achieve the desired behavior rather than finding a column of per-event weights.

We often want to designate an experimental testing DataSet as “data” and one or more MC DataSets as “MC” so that data/MC ratio plots can later be generated automatically. The following weighting specifications from the ABC example make use of this functionality:

v.add_weighting ('livetime', 'test_data',
    line=False, markers=True, marker='.', color='black', errorbars=True,
    use_as_data=True)
v.add_weighting ('weight', 'test_sig', color='blue', add_to_mc=True)
v.add_weighting ('livetime', 'bg', color='purple', add_to_mc=True)
v.setup_total_mc (color='green',)

In this case, test_data will be treated as the “data” sample. the sum of test_sig and bg will be treated as the total MC. Finally, the total MC will be plotted as a green line.

In the ABC example, the classifier is trained to find a signal sample that is also present as a small fraction of the “background data”. This is analogous to training a classifier to identify atmospheric muon neutrinos in an IceCube dataset. However, it is possible to specify that the signal sample has some other weighting, e.g. an \(E^{-2}\) spectrum. Such a weighting can be added to the Validator as follows:

v.add_weighting ('weight_E2', 'test_sig', 'E2',
    color='red', linewidth=2)

Here, the weights are drawn from the test_sig column weight_E2. Note the third positional argument, 'E2'. This is the identifier for this spectral weighting, for this sample. When this third argument is left out, the identifier is automatically set to 'default'.

Once the Validator is configured, it can be saved for later usage and reusage, e.g.:

from pybdt.util import save
save (v, 'sample.validator')

For a real-world example, see the the ml_score() and ml_validator() function from the IC79 northern \(\nu_\mu\) analysis.