pybdt.validate module

Validation suite for PyBDT forests.

class pybdt.validate.Validator(bdt)

Bases: object

Test and validate BDTs.

class Proxy(validator, d, getfunc)

Bases: object

A class to map data set keys directly to the some object, which may require an arbitrary extra dereferencing step.

add_data(key, data, label='', scores=True, pscores=False)

Add a data set to this Validator.

Parameters:
  • key (str) – The key for this data set.

  • data (str or DataSet) – The data set StorableObject initializer.

  • label (str) – A nice label for this data set in plots.

add_weighting(arg, key, wkey='default', label='', add_to_mc=False, use_as_data=False, **style_kwargs)

Add a weighting to a dataset.

Parameters:
  • arg (str, numpy.ndarray, or float) – The name of the weight column, the weights, or the livetime.

  • key (str) – The key for the desired data set.

  • wkey (str) – The key for this weighting of the data set.

  • label (str) – A nice label for this weighting.

  • add_to_mc (bool) – Whether to include this dataset weighting in “total monte carlo” calculations.

  • use_as_data (bool) – Whether to use this dataset weighting as the “data” sample in data/mc ratio calculations.

  • style_kwargs (dict) – Arguments to pass to the histlight.Style constructor.

property bdt

The BDTModel for this Validator.

clear_weightings()

Erase any stored weightings.

create_correlation_matrix_plot(set_spec, exprs=None, fignum=None, cut=None, eval_names={})

Create a correlation matrix plot.

Parameters:
  • set_spec (str or tuple) – See Validator.get_key_wkey()

  • exprs (list) – List of expressions to put on the axes (default: self.bdt.feature_names)

  • fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.

  • cut (str) – An expression for Validator.eval() which returns an array of bools, where True means “include this event”.

  • eval_names (dict) – Names to be passed to Validator.eval() for cut evaluation.

Returns:

The new matplotlib.figure.Figure.

create_correlation_ratio_matrix_plot(set_spec1, set_spec2, exprs=None, fignum=None, clog=False, cut=None, eval_names={})

Create a correlation matrix plot.

Parameters:
  • set_spec1 (str or tuple) – See Validator.get_key_wkey()

  • set_spec2 (str or tuple) – See Validator.get_key_wkey()

  • exprs (list) – List of expressions to put on the axes (default: self.bdt.feature_names)

  • fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.

  • clog (bool) – Whether to use a log color scale (absolute values of ratios will be shown)

  • cut (str) – An expression for Validator.eval() which returns an array of bools, where True means “include this event”.

  • eval_names (dict) – Names to be passed to Validator.eval() for cut evaluation.

Returns:

The new matplotlib.figure.Figure.

create_overtrain_check_plot(sig_train_set_spec, sig_test_set_spec, bg_train_set_spec, bg_test_set_spec, legend={'loc': 'best'}, legend_side='right', fignum=None, expr='scores', **kwargs)

Create an overtraining check plot.

Parameters:
  • sig_train_set_spec (str or tuple) – The signal training set. (Each signal or background training or testing set is specified as with Validator.get_key_wkey())

  • sig_test_set_spec (str or tuple) – The signal testing set.

  • bg_train_set_spec (str or tuple) – The background training set.

  • bg_test_set_spec (str or tuple) – The background testing set.

  • legend (bool or dict) – If True or non-empty dict, draw a legend. If dict, use as keyword arguments for matplotlib.axes.Axes.legend.

  • legend_side (str) – Either ‘left’ or ‘right’; on which axes to draw the legend.

  • fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.

  • expr (str) – The expression to evaluate on each data set.

The following additional keyword arguments are allowed:

Parameters:
  • title (str) – The title of the plot.

  • xlabel (str) – The xaxis label.

  • ylabel (str) – The xaxis label.

  • left_ylabel (str) – The main y axis label.

  • right_ylabel (str) – The secondary y axis label.

  • margin_left (float) – Fraction of width to reserve as left margin.

  • margin_right (float) – Fraction of width to reserve as right margin.

  • margin_top (float) – Fraction of width to reserve as top margin.

  • margin_bottom (float) – Fraction of width to reserve as bottom margin.

  • bins (int) – Number of bins to use in histograms.

Returns:

A dict with string keys and values of the new matplotlib.figure.Figure and each set of axes used. Depending on the above argument, some or all of the following keys will be available:

[‘fig’, ‘first_main_ax’, ‘twin_first_main_ax’, ‘first_ratio_ax’, ‘second_main_ax’, ‘twin_second_main_ax’, ‘second_ratio_ax’]

create_plot(expr, kind, left_set_specs, right_set_specs=[], fignum=None, **kwargs)

Create a BDT score distribution, rate plot, or efficiency plot.

Parameters:
  • expr (str) – An expression for Validator.eval() which returns an numerical array.

  • kind (str) – One of ‘dist’, ‘rate’ or ‘eff’.

  • left_set_specs (list) – What to plot on the main y axis (see Validator.get_key_wkey()).

  • right_set_specs (list) – What to plot on the secondary y axis (see Validator.get_key_wkey()).

  • fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.

A new figure is created using Validator.plot_variable(). The following keyword arguments can be used to create dual linear/log plots.

Parameters:
  • dual (bool) – Make a dual figure with a linear y scale on the left and a log y scale on the right.

  • data_mc (bool) – Include data/mc ratio plot(s).

  • linear_kwargs (dict) – If given, this dict of keyword arguments supercedes individually passed keyword arguments for the linear plot.

  • log_kwargs (dict) – If given, this dict of keyword arguments supercedes individually passed keyword arguments for the linear plot.

The following keyword arguments determine the plot appearance.

Parameters:
  • title (str) – The title of the plot.

  • xlabel (str) – The xaxis label (default: expr).

  • ylabel (str) – The xaxis label.

  • left_ylabel (str) – The main y axis label.

  • right_ylabel (str) – The secondary y axis label.

  • data_mc_ylabel (str) – The data/mc ratio plot y axis label (default: “data/mc ratio”).

  • grid (bool) – Whether to include grids (default: True)

  • margin_left (float) – Fraction of width to reserve as left margin.

  • margin_right (float) – Fraction of width to reserve as right margin.

  • margin_top (float) – Fraction of width to reserve as top margin.

  • margin_bottom (float) – Fraction of width to reserve as bottom margin.

  • aspect (float) – Width / height ratio.

All other keyword arguments are passed through to Validator.plot_variable().

Returns:

A dict with string keys and values of the new matplotlib.figure.Figure and each set of axes used. Depending on the above argument, some or all of the following keys will be available:

[‘fig’, ‘first_main_ax’, ‘twin_first_main_ax’, ‘first_dm_ax’, ‘second_main_ax’, ‘twin_second_main_ax’, ‘second_dm_ax’]

create_variable_pair_plot(set_spec, exprx, expry, bins=100, range=None, fignum=None, clog=False, cut=None, eval_names={})

Create a variable-variable 2D histogram.

Parameters:
  • set_spec (str or tuple) – See Validator.get_key_wkey()

  • exprx (str) – Expression to put on the x axis

  • expry (str) – Expression to put on the x axis

  • bins (int) – The number of bins to create [default: 100].

  • range (tuple of tuples of floats) – If given, the x range and y range in the form ((xmin,xmax), (ymin,ymax))

  • fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.

  • clog (bool) – Whether to use a log color scale (absolute values of ratios will be shown)

  • cut (str) – An expression for Validator.eval() which returns an array of bools, where True means “include this event”.

  • eval_names (dict) – Names to be passed to Validator.eval() for cut evaluation.

property data

Mapping of keys to DataSets.

eval(set_spec, expr, names={})

Evaluate an expression in terms of variables in a dataset.

Parameters:
Returns:

The result of the expression evaluation.

When expr is evaluated, each variable stored in the dataset will be available. The variables ‘scores’, ‘pscores’ and ‘weights’ will also be available. If the dataset has a livetime set, ‘livetime’ will also be available.

Other allowed identifiers are np (Numpy) and scipy , in addition to anything specified in the names parameter.

This method is implemented in terms of DataSet.eval().

property full_label

Mapping of (key,wkey) to full labels.

get_Hist(set_spec, expr, bins=100, range=None, normed=False, cut=None, eval_names={})

Get a histlite.Hist for a variable for a given data set and weighting.

Parameters:
  • set_spec (str or tuple) – See Validator.get_key_wkey()

  • expr (str) – An expression for Validator.eval() which returns an numerical array.

  • bins (int) – The number of bins to create [default: 100].

  • range (2-tuple) – The range over which to make the histogram [default: (min value, max value) found in all included data sets].

  • normed (bool) – Whether to normalize the y axis histograms.

  • cut (str) – An expression for Validator.eval() which returns an array of bools, where True means “include this event”.

  • eval_names (dict) – Names to be passed to Validator.eval() for the main expression or the cut expression.

Returns:

An instance of histlite.Hist.

get_clone(bdt=None)

Construct a copy Validator.

Parameters:

bdt (str or BDTModel) – The BDT model instance StorableObject initializer.

get_correlation(set_spec, expr1, expr2, cut=None, eval_names={})

Get the correlation between two variables for a data set.

Parameters:
get_key_wkey(set_spec)

Get the key and weighting key for a given data set spec.

Parameters:

set_spec (str or tuple) – (key,wkey), or just key (in which case, wkey==’default’ is assumed)

Returns:

A (key,wkey) tuple.

get_kolmogorov_smirnov_probability(set_spec_1, set_spec_2, expr='scores', bins=1000)

Calculate the Kolmogorov-Smirnov p value for two distributions. using kolmogorov_smirnov_probability().

Parameters:
  • set_spec_1 (str or tuple) – The first data and weighting.

  • set_spec_2 (str or tuple) – The second data and weighting.

  • expr (str) – An expression for Validator.eval() which returns an numerical array.

  • bins (int) – The number of bins to use.

Returns:

The Kolmogorov-Smirnov p value.

get_range(set_specs, expr, cut=None, eval_names={})

Get the range of values of variable (after transform) for given datasets and weightings.

Parameters:
Returns:

A (min_val, max_val) tuple.

get_values_weights(set_spec, expr, cut=None, eval_names={})

Evaluate an expression, and get weights and scores.

Parameters:
Returns:

A ([expression result], weights, scores) tuple

property label

Mapping of keys to labels.

load_all_data(dbg=False)

Load all data from disk into RAM.

plot_variable(axes, expr, kind, left_set_specs, right_set_specs=[], twin_axes=None, data_mc=False, **kwargs)

Create a BDT score distribution, rate plot, or efficiency plot.

Parameters:
  • axes (matplotlib.axes.Axes) – The Axes on which to draw the plot.

  • expr (str) – An expression for Validator.eval() which returns an numerical array.

  • kind (str) – One of ‘dist’, ‘rate’ or ‘eff’.

  • left_set_specs (list) – What to plot on the main y axis (see Validator.get_key_wkey()).

  • right_set_specs (list) – What to plot on the secondary y axis (see Validator.get_key_wkey()).

  • twin_axes (matplotlib.axes.Axes) – The secondary-y axes, if already created with axes.twinx().

  • data_mc (bool) – Plot ratio of given curves to total_mc.

If ‘total_mc’ is included in either left_set_specs or right_set_specs, then a total monte carlo line will be added.

The following additional kwargs are allowed.

Parameters:
  • legend (bool or dict) – If True or non-empty dict, draw a legend. If dict, use as keyword arguments for matplotlib.axes.Axes.legend.

  • cut (str) – An expression for Validator.eval() which returns an array of bools, where True means “include this event”.

  • eval_names (dict) – Names to be passed to Validator.eval().

  • log (bool) – Whether to use a log-y scale.

  • normed (bool) – Whether to normalize the y axis histograms.

  • left_log (bool) – Whether to use a log-y scale on the main y axis.

  • left_normed (bool) – Whether to normalize the main y axis histograms.

  • right_log (bool) – Whether to use a log-y scale on the secondary y axis.

  • right_normed (bool) – Whether to normalize the secondary y axis histograms.

  • dbg (bool) – Whether to print debugging/logging information while plotting.

property pscores

Mapping of keys to purity-based score arrays.

property scores

Mapping of keys to score arrays.

setup_total_mc(label='Total MC', **style_kwargs)

Setup total monte carlo plotting properties.

Parameters:

label (str) – The label for total monte carlo lines.

Any additional arguments are passed to the histlight.Style constructor.

property style

A mapping of set_specs to histlight.Style objects.

property weight_label

Mapping of keys to mappings of weight keys to weight labels.

property weights

Mapping of keys to mappings of weight keys to weight arrays.