Introduction to decision tree randomization

Decision tree classifiers are powerful because they find the best cut for each region in the many dimensional parameter space. BDT classifiers are an improvement over single decision trees because they can provide good classification for events in the tails of the variable distributions without becoming overtrained on fluctuations in those distributions.

A more recent innovation is to generate so-called Random Forests. These classifiers, like BDT classifiers, also make use of a forest of decision trees to provide a score for events. However, rather than using boosting to differentiate the individual trees, an element of randomness is introduced.

Typically, one uses either boosting with no randomization, or randomization with no boosting (boost strength of 0). However, there is no technical reason why these techniques cannot be combined, and so the implementation in pybdt allows the user to use both if desired.

pybdt allows the following two types of randomization.

cut variable randomization

The user provides an integer num_random_variables, which must be less than the total number of variables being used. Then during training, at each node, only num_random_variables variables are randomly selected to be considered for choosing a cut.

training event randomization

The user provides a fraction frac_random_events between 0.0 and 1.0. Then during training, for each tree, only frac_random_events fraction of the full training sample is used for training.

In the ABC example, training event randomization is used to reduce training sample overtraining. By using different events to train each tree, we avoid tuning to fluctuations in the training sample.