pybdt.ml module

Wrapper for C++ backend.

class pybdt.ml.BDTLearner(feature_names=[], weight_name='', bg_weight_name='', _=None)

Bases: Learner

Train boosted decision trees.

add_after_pruner(pruner)

Add a Pruner for after boosting.

add_before_pruner(pruner)

Add a Pruner for before boosting.

property after_pruners

List of Pruner’s which are applied after boosting.

property before_pruners

Pruner’s which are applied before boosting.

property beta

The AdaBoost scaling factor.

clear_after_pruners()

Clear the set of Pruner’s used after boosting.

clear_before_pruners()

Clear the set of Pruner’s used before boosting.

property dtlearner

The DTLearner used to train individual trees.

property frac_random_events

The fraction of events to use for training each tree.

Set to 1.0 to use every event for every tree.

property num_trees

The number of individual decision trees to train.

property quiet

Whether to silence the training progress bar.

set_defaults()

Reset default BDTLearner (and internal DTLearner) properties.

property use_purity

Whether to use decision tree leaf purity information.

If this option is set to True, purity information will be used during training as described in J. Zhu, H. Zou, S. Rosset, T. Hastie, “Multi-class AdaBoost”, 2009.

class pybdt.ml.BDTModel(feature_names=[], dtmodels=[], alphas=[], _=None)

Bases: Model

Represent a boosted decision tree classifier.

property alphas

The alphas, or weights, for each decision tree.

property dtmodels

The DTModels that make up this BDTModel.

event_variable_importance(event, sep_weight=True, tree_weight=True)

Get a dictionary of variable importance values.

Parameters:
  • event (dict) – A mapping from variable names to float values.

  • sep_weight (bool) – Whether to weight nodes where a variable is used by separation gain achieved rather than weighting all nodes equally.

  • tree_weight (bool) – Whether to trees according to their performance on the training set rather than weighting all trees equally.

Returns:

dict with variable name keys and float values from 0 to 1.

get_subset_bdtmodel(n_i, n_f)

Get a BDTModel using DTModels number n_i thru n_f.

Parameters:
  • n_i (int) – The number of the first DTModel to include.

  • n_f (int) – The number of the last DTModel to include plus one.

The parameters of this method follow the indexing convention of Python’s builtin range(i,j).

get_subset_bdtmodel_list(dtmodel_indices)

Get a BDTModel using DTModels number n_i thru n_f.

Parameters:

dtmodel_indices (list of int) – The numbers of the DTModels to include.

get_trimmed_bdtmodel(threshold)

Get a BDTModel using only DTModels that differ enough from the preceeding one.

Parameters:

threshold (float) – The minimum percent change in alpha values of consecutive trees required in order to keep a given tree

Warning

This method may not be useful, and should be considered experimental.

property n_dtmodels

The number of DTModels in this BDTModel.

variable_importance(sep_weight=True, tree_weight=True)

Get a dictionary of variable importance values.

Parameters:
  • sep_weight (bool) – Whether to weight nodes where a variable is used by separation gain achieved rather than weighting all nodes equally.

  • tree_weight (bool) – Whether to trees according to their performance on the training set rather than weighting all trees equally.

Returns:

dict with variable name keys and float values from 0 to 1.

class pybdt.ml.CostComplexityPruner(strength=None, _=None)

Bases: Pruner

Prune trees by eliminating nodes with the worst information-added to complexity-added ratio.

static gain(node)

The weighted Gini separation gain of this node.

Parameters:

node (DTNode.) – The node.

static rho(node)

The relative cost of pruning this node.

Parameters:

node (DTNode.) – The node.

property strength

The pruning strength.

Once the pruning sequence is computed, this is the percentage (0-100) of the prune operations which are actually executed.

class pybdt.ml.DTLearner(feature_names=[], weight_name='', bg_weight_name='', _=None)

Bases: Learner

Train single decision trees.

property linear_cuts

True).

Type:

Space cuts linearly (default

property max_depth

The maximum depth to which to train each individual tree.

property min_split

The minimum number of entries in a node which warrants further splitting.

property num_cuts

The number of cuts to try at each potential split.

property num_random_variables

The number of variables to consider using at each node.

Set to 0 to use every variable at every node.

property separation_type

The separation type to use (one of ‘cross_entropy’, ‘gini’, or ‘misclass_error’: default is ‘gini’).

Warning

As of this writing, only ‘gini’ is known to be well-tested.

set_defaults()

Reset default DTLearner properties.

class pybdt.ml.DTModel(feature_names=[], root=None, _=None)

Bases: Model

Represent a decision tree.

event_variable_importance(event, sep_weight=True)

Get a dictionary of variable importance values.

Parameters:
  • event (dict) – A mapping from variable names to float values.

  • sep_weight (bool) – Whether to weight nodes where a variable is used by separation gain achieved rather than weighting all nodes equally.

Returns:

dict with variable name keys and float values from 0 to 1.

property root

Get the root DTNode.

variable_importance(sep_weight=True)

Get a dictionary of variable importance values.

Parameters:

sep_weight (bool) – Whether to weight nodes where a variable is used by separation gain achieved rather than weighting all nodes equally.

Returns:

dict with variable name keys and float values from 0 to 1.

class pybdt.ml.DTNode(w_sig, w_bg, n_sig, n_bg, sep_index, sep_gain=None, feature_id=None, feature_val=None, left=None, right=None, _=None)

Bases: object

Represent a node in a decision tree.

property feature_id

The id of the feature for this cut.

If this is a leaf, feature_id is +1 or -1 for signal or background, respectively.

property feature_name

The name of the feature for this cut.

property feature_val

The cut value for the feature specified by feature_id at this node.

property is_leaf

Whether this node is a leaf.

property left

The node for feature < feature_val.

property max_depth

The maximum depth of the tree below this node.

property n_bg

The number of training background events in this node.

property n_leaves

The number of leaves below (and including) this node.

property n_sig

The number of training signal events in this node.

property n_total

The number of training signal + background events in this node.

prune()

Prune the tree at this node.

This method prunes the tree at this node. The node becomes a leaf. If the purity is greater than 50%, it is a signal leaf; otherwise it is a background leaf.

property purity

The purity of this node.

property right

The node for feature >= feature_val.

property sep_gain

The separation gain from this node.

property sep_index

The separation index at this node.

property tree_size

The size of the tree below (and including) this node.

property w_bg

The sum of background weight in this node.

property w_sig

The sum of signal weight in this node.

property w_total

The sum of signal and background weight in this node.

class pybdt.ml.DataSet(data, subset='all', _=None)

Bases: object

A pybdt-friendly representation of a set of events.

eval(expr, names={})

Evaluate an expression in terms of variables in this DataSet.

Parameters:
  • expr (str) – The expression to evaluate.

  • names (dict) – Names to be passed into eval.

Returns:

numpy.ndarray — one element per event

When expr is evaluated, each variable stored in the dataset will be available. If the dataset has a livetime set, ‘livetime’ will also be available.

Other allowed identifiers are np (Numpy) and scipy, in addition to anything specified in the names parameter.

get_subset(idx)

Get a subset of this dataset.

Parameters:

idx (array of bools) – The subset of samples to keep.

Returns:

pybdt.ml.Dataset.

property livetime

The livetime of this DataSet (or -1 if never specified).

property n_events

The number of rows in this DataSet.

property n_features

The number of columns in this DataSet.

property names

The names of the features stored by this DataSet.

to_dict()

Get a dictionary with all data from the DataSet.

class pybdt.ml.ErrorPruner(strength=None, _=None)

Bases: Pruner

Prune trees by eliminating the nodes which least improve the estimated error.

Warning

As of this writing, this pruning method is not yet well-tested and should be considered unsupported.

node_error(node)

The expected error of this node (affected by strength parameter).

Parameters:

node (DTNode.) – The node.

property strength

The pruning strength.

Once the pruning sequence is computed, this is the percentage (0-100) of the prune operations which are actually executed.

subtree_error(node)

The expected error of the subtree below this node.

Parameters:

node (DTNode.) – The node.

class pybdt.ml.Learner(_)

Bases: object

Train classification models.

train(signal_dataset, background_dataset)

Train using the given DataSets.

train_given_weights(signal_dataset, background_dataset, signal_weights, background_weights)

Train using the given DataSets and the given weights.

class pybdt.ml.Model(_)

Bases: object

Classify events based on some past training by a Learner.

Learners ultimately return Models upon training. The Model can be used to classify events using the score() methods.

property feature_names

The names of the event features used by this Model.

score(data, use_purity=False, quiet=False)

Obtain the score for a set of events.

Parameters:
  • data (DataSet or dict) – Either a DataSet or a DataSet initializer dict.

  • use_purity (bool) – Whether to use decision tree leaf purity information (as opposed to returning -1 or +1 for an individual decision tree).

  • quiet (bool) – Whether to suppress the progress bar.

This convenience method calls Model.score_DataSet(), Model.score_dict() or Model.score_event() as appropriate.

score_DataSet(ds, use_purity=False, quiet=False)

Obtain the score for a DataSet object.

Parameters:
  • ds (DataSet) – The dataset.

  • use_purity (bool) – Whether to use decision tree leaf purity information (as opposed to returning -1 or +1 for an individual decision tree).

  • quiet (bool) – Whether to suppress the progress bar.

Returns:

A numpy.ndarray of per-event scores.

score_dict(data, use_purity=False, quiet=False)

Obtain the score for a set of events.

Parameters:
  • data (dict) – A DataSet initializer dict.

  • use_purity (bool) – Whether to use decision tree leaf purity information (as opposed to returning -1 or +1 for an individual decision tree).

  • quiet (bool) – Whether to suppress the progress bar.

Returns:

A numpy.ndarray of per-event scores.

score_event(event, use_purity=False)

Obtain the score for a single event.

Parameters:
  • event (dict) – A mapping from variable names to float values.

  • use_purity (bool) – Whether to use decision tree leaf purity information (as opposed to returning -1 or +1 for an individual decision tree).

Returns:

A single float BDT score.

class pybdt.ml.MultiModel1D(column, bins, bdts)

Bases: Model

A collection of BDTs, one for each bin along a single axis.

get_cut(cut_values)

A MultiModel1DCut for this MultiModel1D.

Parameters:

cut_values (array-like) – The per-bin cut values.

score_DataSet(ds)

Obtain the scores for a DataSet.

Parameters:

ds (DataSet) – Thte dataset.

score_dict(data)

Obtain the score for a set of events.

Parameters:

data (dict) – A DataSet initializer dict.

Returns:

A numpy.ndarray of per-event scores.

score_event(event)

Obtain the score for an event.

Parameters:

event (dict) – A mapping of BDT variable name -> value.

class pybdt.ml.MultiModel1DCut(multi_bdtmodel_1d, cut_values)

Bases: object

A cut which takes one value per MultiModel1D bin.

decision(thing, scores=None)

Return the cut decision for thing, possibly given scores.

Parameters:
  • thing (DataSet or dict, or bin column array) – If DataSet or dict, see Model.score(); otherwise, this is an array of values by which the MultiModel1D is binned.

  • scores – array-like

  • scores – The score or scores for the event or events (score or scores are calculated if not given).

class pybdt.ml.Pruner(_)

Bases: object

Prune DTModel’s.

prune(tree)

Prune a decision tree.

Parameters:

tree (DTModel) – The decision tree to prune.

class pybdt.ml.SameLeafPruner(_=None)

Bases: Pruner

Prune trees where adjacent leaves yield the same class.

class pybdt.ml.VineLearner(vine_feature, vine_feature_min, vine_feature_max, vine_feature_width, vine_feature_step, learner, _=None)

Bases: Learner

Train VineModels.

property learner

The underlying learner used for each vine bin.

property quiet

Whether to silence the training progress bar.

property vine_feature

Whether to silence the training progress bar.

property vine_feature_max

Whether to silence the training progress bar.

property vine_feature_min

Whether to silence the training progress bar.

property vine_feature_step

Whether to silence the training progress bar.

property vine_feature_width

Whether to silence the training progress bar.

class pybdt.ml.VineModel(_)

Bases: Model

Represent a decision tree.

pybdt.ml.get_epsilon()

Get the global epsilon value.

Returns:

The value used to trim purity from [0,1] to [epsilon,1-epsilon] when using purity for training or scoring

pybdt.ml.set_epsilon(eps)

Set the global epsilon value.

Parameters:

eps (float) – The value to use to trim purity from [0,1] to [epsilon,1-epsilon] when using purity for training or scoring

pybdt.ml.unwrapped(py_object)

Get a pure-Python wrapped instance.

Parameters:

cpp_object (object) – A pure-Python pybdt class instance.

Returns:

A C++ pybdt class instance.

pybdt.ml.wrapped(cpp_object)

Get a pure-Python wrapped instance.

Parameters:

cpp_object (Boost.Python.instance) – A C++ pybdt class instance.

Returns:

A pure-Python class instance.