Classifier Application

Once a classifier has been trained and validated to the your liking, you are ready to use it to classify events of otherwise unknown class. This can be done for individual samples or for whole ensembles at a time. Here we go over methods of classifier application provided by pybdt.

Scoring with IceTray

For IceCube data, it may be convenient to store a classifier score as an I3Double in each Physics frame. This can be done relatively simply with icecube.pybdtmodule.PyBDTModule. For the ABC example, you can calculate and store the scores like so:

from icecube.icetray import I3Tray
from icecube import dataio
from icecube.pybdtmodule import PyBDTModule

tray = I3Tray()
[...]

def varsfunc (frame):
    a = frame['A'].value
    b = frame['B'].value
    c = frame['C'].value
    out = dict (a=a, b=b, c=c)
    return out

tray.AddModule (PyBDTModule, 'bdt',
    BDTFilename='path/to/sample.bdt',
    varsfunc=varsfunc,
    OutputName='Score'
)
[...]

It is up to the user to provide varsfunc(), which extracts the relevant features from the frame for use by icecube.pybdtmodule.PyBDTModule. An alternative implemmentation would inspect each frame directly for I3Double’s corresponding to each required feature, but that approach would require unnecessary pollution of the frames when features are nested deep in frame objects. Consider possible features zenith = SplineMPE.dir.zenith or plogl = SPEFitFitParams.logl / (SPEFitFitParams.ndof + 1.5). The use of varsfunc() simplifies scoring when training features are derived from, but not exactly corresponding to, variables produced by the rest of the processing chain.

Scoring without IceTray

In other contexts, it is useful to calculate scores outside of the IceTray framework. This can be done for individual events or ensembles at a time. An individual event can be scored like so:

score = bdt.score_event ({'a': 2.7, 'b': -0.31, 'c': 116})

The dict argument can of course be calculated by any method so long as the keys match the features used by the classifier and the values are real numbers.

Suppose one has an ensemble of events with numpy arrays a, b and c holding one value per event. Then the per-event scores can be calculated like so:

scores = bdt.score_dict (dict (a=a, b=b, c=c))

If a pybdt.ml.DataSet object data is already available, the per-event scores can be calculated like so:

scores = bdt.score_DataSet (data)

In these last two exaples, the resulting scores are a numpy array with dtype=float.

A convienience method pybdt.ml.BDTModel.score() is provided which automatically calls the correct one of the above methods, given the input. Thus the following calls all work as expected:

score = bdt.score ({'a': 2.7, 'b': -0.31, 'c': 116})
scores = bdt.score (dict (a=a, b=b, c=c))
scores = bdt.score (data)

Finally, note that the leaf purity \(p\) can be used to scale the contribution of each tree in the model (see Introduction to decision tree classifiers). This is especially common for forests in which the trees are differentiated only by randomization but not by boosting. To enable this type of scoring, change the above calls to:

score = bdt.score ({'a': 2.7, 'b': -0.31, 'c': 116}, use_purity=True)
scores = bdt.score (dict (a=a, b=b, c=c), use_purity=True)
scores = bdt.score (data, use_purity=True)