Validator Usage

Once a pybdt.validate.Validator is configured, it is easy to generate most of the classification-related plots needed for an analysis. On this page, we go over these plotting capabilities.

Summary Plots

pybdt.validate.Validator includes a powerful plotting method, pybdt.validate.Validator.create_plot(), which can be used for generating a summary of the overall classifier performance. Typically the first plot of interest is the score distribution for the testing and training samples. In the ABC example, this is done like so (slightly edited for brevity):

lines = ['test_data', 'test_sig', 'bg', 'total_mc']
objs = v.create_plot ('scores', 'dist',
    lines,
    bins=100, range=(-1, .5),
    dual=True,
    data_mc=True,
    xlabel='BDT score',
    left_ylabel='Hz per bin',
    title='BDT score distributions',
    linear_kwargs=dict (
        legend=dict (loc='best')
    ),
)
objs['second_main_ax'].set_ylim (ymin=1e-6)
objs['fig'].savefig ('output/dist_vs_bdt.png')

Let’s unpack this example. First, we specify the datasets we want to plots as lines. Then we create the plot. The first argument is an expression that can be evaluated by pybdt.validate.Validator.eval() – in this case, the per-event classifier scores. The second argument is either 'dist' (a simple histogram), 'rate' (a histogram summed cumulatively-to-the-left), or 'eff' (like 'rate', but divided by the leftmost point to give an efficiency curve). The bins and range arguments specify the histogram properties. dual=True means that a two panel plot will be produced with a linear vertical scale on the left and a log vertical scale on the right (by default, a single linear-vertical panel will be produced, but if log=True is set, a single log-vertical panel will be produced). data_mc=True means that data/MC ratio plots will be included under the main panels. xlabel specifies the horizontal axis label for both panels. left_ylabel specifies the left-vertical axis label for both main panels (we’ll revisit the right-vertical axis soon). title gives an overall figure title. linear_kwargs is used to pass extra arguments specifically to the left (linear-vertical) plot – in this case, a legend location.

The returned value, objs, is a dict with the following keys:

['fig', 'first_main_ax', 'second_main_ax', 'first_dm_ax', 'second_dm_ax']

The first is the matplotlib figure itself. The next two are the main matplotlib axes. The last two are the data/MC ratio plot matplotlib axes. In the example, we use these to tweak the vertical axis range and then to save the figure, but any matplotlib customizations are available here.

In the example script pybdt/resources/examples/validate_sample_bdt.py, the BDT score distribution is plotted this way; the overall event rate and the cut efficiency as a function of BDT score cut are also plotted in a similar way, changing little more than the second argument to create_plot().

Overtraining Check

The simplest way to check for overtraining is to compare the classifier performance for the samples used for training against independent testing samples. In the ABC example, this plot is made like so:

objs = v.create_overtrain_check_plot (
        'train_sig', 'test_sig',
        'train_data', 'test_data',
        left_ylabel='relative abundance (background)',
        right_ylabel='relative abundance (signal)',
        legend_side='left',
        legend=dict (loc='upper left'),
        )

Here, the first two arguments are the training and testing signal sample specifications. The next two are the training and testing background sample specifications. Then the left and right vertical axis labels are given. Finally, in the last two keyword arguments we request that the legend be placed on the left (linear) panel, and that the legend be placed in the upper left of that panel.

The resulting plot shows the training and testing, signal and background distributions (four distributions total) on a linear vertical scale (left panel) and log vertical scale (right panel). It also shows the testing / training ratio below these main panels. Finally, in the legend, it shows the the Kolmogorov-Smirnov p-value when the testing and training datasets are compared. A small p-value suggests that the distributions differ significantly.

A more rigorous overtraining test would repeat this process for multiple (possibly overlapping) testing/training dataset splits. However, for IceCube, we typically are satisfied if performance is consistent for a single testing/training split.

Other Plotting Features

In Dataset weighting, we discussed the possibility of configuring multiple weightings for a single ensemble of events (as is commonly used for, e.g., IceCube neutrino simulation). To facilitate the use of alternative weightings, there are two ways to specify dataset+weighting combinations for plotting. In the ABC example, only the simplest is needed: give the dataset identifier, and the 'default' weighting will be used. If a non-default weighting is desired, it can be specified as a tuple: (dataset, weighting). Here is an example from a real analysis:

x = v.create_plot ('scores', 'dist',
    ['nugen', 'corsika', 'total_mc', 'test_exp', ],
    [('test_nugen_wr', 'E2')],
    bins=bins,
    range=(-1,1),
    dual=True,
    xlabel='BDT score',
    left_ylabel='Hz per bin',
    right_ylabel='relative abundance (signal)',
    data_mc=True,
    log_kwargs=dict (legend=dict (loc='lower left')),
    title='BDT Score Distribution',
    )

The salient featurehere is the fourth positional argument to create_plot(); this gives one or more dataset+weighting specification, or set_spec, for a testing sample of well-reconstructed neutrino-generator events weighted to an \(E^{-2}\) spectrum. Additional dataset specifications in this argument are plotted against the right-vertical axis. In this analysis, the overall normalization of the training and testing signal samples was arbitrary, so we use the right_ylabel keyword argument to give an appropriate right-vertical axis label.

pybdt.validate.Validator.create_plot() can also be used to create other variable distributions. For example, in the ABC example, the a distribution after a BDT score cut of score > cut_level can be obtained like so:

objs = v.create_plot ('a', 'dist',
    lines,
    bins=50,
    dual=True,
    data_mc=True,
    left_ylabel='Hz per bin',
    xlabel=name,
    log_kwargs=dict (
        legend=dict (loc='best')
        ),
    title='{0} | bdt score > {1:.3f}'.format (name, cut_level),
    cut='scores > {0}'.format (cut_level)
    )

Here, the cut argument is used to specify a cut that should be applied to every dataset prior to generating the plot. This mechanism allows the creation of similar plots for several or all parameters at multiple cut levels with limited code repetition.

Finally, the Validator can produce variable correlation matrix plots. For example, a color-coded correlation matrix plot for the training signal sample can be obtained simply with:

fig = v.create_correlation_matrix_plot ('train_sig')
fig.savefig ('output/correlation_matrix-signal.png')

See pybdt/resources/examples/validate_sample_bdt.py for more example code; see pybdt.validate.Validate for other Validator capabilities. For a long, but possibly instructive, real-world example, see the the ml_plot() function from the IC79 northern \(\nu_\mu\) analysis.