.. _man_inspection: Tree and Forest Inspection ========================== One of the design goals of pybdt is to make it relatively straightforward for users to inspect the classifier. On this page we discuss some features that make this easy. Direct Acces ------------ The most basic way to learn about a trained classifier is to check its properties in `IPython `_. If a :class:`BDTModel` is loaded as ``bdt``, then the number of constituent :class:`DTModel` s is ``bdt.n_dtmodels`` (an ``int``). The per-tree relative weights are given by ``bdt.alphas`` (numpy array with dtype ``float``). The :class:`DTModel` s themselves are stored in ``bdt.dtmodels`` (numpy array with dtype ``object``). The cut parameters used by the classifier are ``bdt.feature_names`` (list of ``string`` s). Now, suppose we store the first tree as ``dt = bdt.dtmodels[0]``. The root :class:`DTNode` is ``node = dt.root``; you can jump from a node to its children with :attr:`DTNode.left` and :attr:`DTNode.left` (until you reach a leaf node, for which both children are ``None``). Split nodes apply a cut of the form ``node.feature_name`` < ``node.feature_val``; passing events descend left and failing events descend right. Other details are also available -- see the :class:`DTNode` reference. Visualization ------------- pybdt includes a module :mod:`pybdt.viz` for easily visualizing trees. For a text-only printout, we can say:: from pybdt import viz print viz.dtmodel_to_text (dt) For a graphical visualization, we can instead say:: pic = viz.dtmodel_to_graphviz (dt) pic.write_png (filename + '.png') pic.write_pdf (filename + '.pdf') and so forth -- many ``write_*`` methods are provided by graphviz. This feature requires the pydot interface to graphviz to be installed. On ubuntu, it's provided by the ``python-pydot`` package. Variable Importance ------------------- The features described above make it reasonably easy to dig into the details of individual trees, and the plotting facilities (:doc:`man_validator_usage`) offer a number of options for quantifying the overall classifier performance. However, one question remains to be answered: to what extent does each feature participate in the classifier? This can be addressed with the following calls:: print viz.variable_importance_to_text (bdt.variable_importance (True, True)) #1 print viz.variable_importance_to_text (bdt.variable_importance (False, False)) #2 print viz.variable_importance_to_text (bdt.variable_importance (True, False)) #3 print viz.variable_importance_to_text (bdt.variable_importance (False, True)) #4 :meth:`pybdt.ml.BDTModel.variable_importance` produces a ``dict`` with string keys (the variable names) and floating point values (the relative importance, between 0 and 1). :func:`pybdt.viz.variable_importance_to_text` converts this ``dict`` to a string consisting of an easily-readable table, sorted by descending variable importance. ``variable_importance()`` measures the importance of variables by counting the split nodes using each feature. The first agrument tells whether the count should be weighted by the separation gain (``node.sep_gain``) achieved by each split. The second argument tells whether the count should be weighted by the overall weight (``alpha``) of the tree in which each split occurs. Here is the variable importance for the classifier in the :ref:`ABC example ` -- first with both weightings enabled, and then with both disabled:: print viz.variable_importance_to_text (bdt.variable_importance (True, True)) 1. b : 0.556785 2. c : 0.259359 3. a : 0.183856 print viz.variable_importance_to_text (bdt.variable_importance (False, False)) 1. c : 0.387212 2. b : 0.349272 3. a : 0.263517 In general, enabling weighting causes splits in earlier trees and splits closer to the roots of trees to be counted more strongly, relative to when weighting is disabled. Thus the weighted variable importance can be interpreted as a measure that is biased in favor of those variables responsible for the classification of the bulk of events. The unweighted variable importance is, by comparison, more fair to variables used in later trees or closer to leaf nodes; it can be interpreted as a measure of which variables are responsible for classifying the trickiest events. It is less clear how the ``(True, False)`` and ``(False, True)`` variable importance weightings should be interpreted, but the former will have more of a bias towards the roots of trees and the latter will have more of a bias towards earlier trees in the forest.