.. _man_random_intro:

Introduction to decision tree randomization
===========================================

Decision tree classifiers are powerful because they find the best cut
for each region in the many dimensional parameter space. BDT
classifiers are an improvement over single decision trees because they
can provide good classification for events in the tails of the
variable distributions without becoming overtrained on fluctuations in
those distributions.

A more recent innovation is to generate so-called *Random Forests*.
These classifiers, like BDT classifiers, also make use of a forest of
decision trees to provide a score for events. However, rather than
using boosting to differentiate the individual trees, an element of
randomness is introduced.

Typically, one uses either boosting with no randomization, or
randomization with no boosting (boost strength of 0). However, there
is no technical reason why these techniques cannot be combined, and so
the implementation in pybdt allows the user to use both if desired.

pybdt allows the following two types of randomization.

    cut variable randomization

        The user provides an integer ``num_random_variables``, which
        must be less than the total number of variables being used.
        Then during training, *at each node*, only
        ``num_random_variables`` variables are randomly selected to be
        considered for choosing a cut.

    training event randomization

        The user provides a fraction ``frac_random_events`` between
        0.0 and 1.0.  Then during training, *for each tree*, only
        ``frac_random_events`` fraction of the full training sample is used
        for training.


In the ABC example, training event randomization is used to reduce training
sample overtraining. By using different events to train each tree, we avoid
tuning to fluctuations in the training sample.