.. _man_training: DataSets and Training ===================== Working with machine learning can be broadly divided into 3 phases: training, testing, and application. On this page we discuss training with pybdt. DataSet objects --------------- pybdt uses :class:`pybdt.ml.DataSet` objects to access your data. Generating a DataSet object is very straightforward. The data should be placed into `numpy `_ arrays with one element per event. Then, you use a python dict to construct the DataSet. For example,:: # first, load variables a, b, and c, and event weights, from # wherever they are stored from pybdt import ml sim_data = ml.DataSet (dict (a=a, b=b, c=c, weight=weight)) Data columns can be accessed using the ``__getitem__`` operator, i.e.:: a = sim_data['a'] weight = sim_data['weight'] It can be convenient to save a livetime for a DataSet, which may be passed in as a single ``float`` like so: :: ... livetime = 3.15e7 data = ml.DataSet (dict (a=a, b=b, c=c, livetime=livetime)) The livetime may be read or changed using the :attr:`pybdt.ml.DataSet.livetime` property. Training -------- The simplest way to train is to use the included ``pybdt/resources/scripts/train.py`` script. Let's have a look at the help output. :: Usage: train.py {options} [comma-sep'd variables] \ [training signal] [training background] [bdt filename] Options: -h, --help show this help message and exit Decision Tree options: -d N, --depth=N make trees N levels deep -L, --nonlinear-cuts use nonlinear cut spacing -c N, --num-cuts=N try N cuts per var per node -s N, --min-split=N do not split if a node contains fewer than N events -p STRENGTH, --prune-strength=STRENGTH use STRENGTH prune strength -v N, --num-random-variables=N use N randomly selected variables at each node (0 to use every var at every node) Forest options: -t N, --num-trees=N use N trees -b BETA, --beta=BETA use BETA boost parameter -e FRAC, --frac-random-events=FRAC use FRAC randomly selected fraction of events in each tree (1, the default, to use every event in every tree) Data options: --sig-weight=COLNAME the name of the variable in which the signal weights are stored --bg-weight=COLNAME the name of the variable in which the bg weights are stored An example of the usage of this script can be found in pybdt/resources/examples/train_sample_bdt.sh. For a real-world example, see the the `ml_train() `_ function from the IC79 northern :math:`\nu_\mu` analysis.