Under the hood
This page collects some useful information for debugging and developing bamboo.
Debugging problems
Despite a number of internal checks, bamboo may not work correctly in some cases. If you encounter a problem, the following list of tips may help to find some clues about what is going wrong:
does the error message (python exception, or ROOT printout) provide any hint? (for batch jobs: check the logfile, its name should be printed)
try to rerun on (one of) the offending sample(s), with debug printout turned on (by passing the
-v
or--verbose
option, for failed batch jobs the main program prints the command to reproduce)if the problem occurs only for one or some samples: is there anything special in the analysis module for this sample, or in its tree format? The interactive mode to explore the decorated tree can be very useful to understand problems with expressions.
in case of a segmentation violation while processing the events: check if you are not accessing any items from a container that are not guaranteed to exist (i.e. if you plot properties of the 2nd highest-pt jet in the event, the event selection should require at least two jets; with combinations or selections of containers this may not always be easy to find). The
bamboo.analysisutils.addPrintout()
function may help to insert printout statements in the RDataFrame graph, see its description for an example.check the open issues to see if your problem has already been reported, or is a known limitation, and, if not, ask for help on mattermost or directly create a new issue
Different components and their interactions
Expressions: proxies and operations
The code to define expressions (cuts, weight factors, variables) in bamboo
is designed to look as if the per-event variables (or columns of the
RDataFrame) are manipulated directly, but what actually happens is that a
python object representation is constructed.
The classes used for this are defined in the bamboo.treeoperations
module, and inherit from TupleOp
.
There are currently about 25 concrete implementations.
These classes contain the minimal needed information to obtain the value they represent (e.g. names of columns to retrieve, methods to call), but generally no complete type information or convenience methods to use them. They are used by almost all other bamboo code, but not meant to be directly manipulated by the user code—this is what the proxy classes are for.
The main restriction on TupleOp
classes is
that, once constructed, the operation part of an expression should not be
modified.
More specifically: not after they have been passed to any backend code (so
directly after construction, e.g. by cloning, should be safe, but since
subexpressions may be passed on-demand, one should not make any assumptions in
other cases).
This allows to cache the hash of an operation, and thus very fast lookup of
expressions in sets and dictionaries, which the backend uses extensively.
The proxy classes wrap one or more operations, and behave as the resulting
value.
In some cases the correspondence is trivial, e.g. a branch with a single
floating-point number is retrieved with a
GetColumn
operation, and wrapped with a
FloatProxy
, which overloads operators for
basic math, but a proxy can also represent an object or concept that does not
correspond to a C++ type stored on the input tree, e.g. an electron (the
collection of values with the same index in all Electron_*[nElectron]
branches), or a subset of the collection of electrons, whose associated
operation would be a list of indices, with the proxy holding a reference to
the original collection proxy.
All proxy classes (currently about 25) are defined in the
bamboo.treeproxies
module, and inherit from the
TupleBaseProxy
base class, which means they
need to have an associated type, and hold a reference to a parent operation.
Operations only refer to other operations and constants, not to proxies, so
when an action (overloaded operator, member method, or a function from
bamboo.treefunctions
) is performed on a proxy, a new proxy is
returned that wraps the resulting operation.
In principle proxies are only there for the user code: starting from the input tree proxy, expressions are generated and passed to the backend, which strips off the proxy, and generates executable code from the operation (possibly retaining the result type from the proxy, if relevant for the produced output, e.g. when producing a skimmed tree). There are therefore few constraints on how the proxy classes work, as long as the result of any action on them produces a valid operation with the expected meaning.
Tree decorations
All user-defined expressions start from the decorated input tree, which can,
following the previous subsection, be seen as a tree proxy.
In fact, this is exactly what the tree decoration method does: it generates the
necessary ad-hoc types that inherit from the building block proxy classes from
bamboo.treeproxies
, and also have all the attributes corresponding to
the branches of the input tree.
Technically, this is done with the type builtin, and a few descriptor
classes.
Much of the information needed for this can be obtained by introspecting the tree, but some details, e.g. about systematics to enable, may need to be supplied by the user.
Selections, plots, and the RDataFrame
The main thing to know about the RDataFrame in bamboo is that partial results
are declared upon construction of Plot
and
Selection
objects.
The backend keeps a shadow graph of selections (with their alternatives under
systematic variations, if needed), and, for each of these, a list of the
operations that have been defined as a new column.
When an operation is converted to a C++ expression string, a reference to the
selection node where it is needed is passed, such that subexpressions can be
defined on-demand (as explained in this section, if a
precalculated column is needed for a selection, it may be beneficial to declare
that earlier rather than later).
This makes the verbose output a bit harder to read (to avoid redeclaring the
same function, argument names are also replaced), but ensures the correct order
of definition and reasonable efficiency.
Currently, all operations that take range arguments, and those that are
explicitly marked, are precalculated.
Function calls, notably, are not, since most are cheap to evaluate—this is
why expensive function calls sometimes should be explicitly requested to be
precalculated for a specific selection with
bamboo.analysisutils.forceDefine()
.
Organisationally, the bookkeeping code, and all the code that accesses the
interpreter and RDataFrame directly, is kept in
bamboo.dataframebackend
, while the conversion of a
TupleOp
is done by its
get_cppStr()
method (many of these are
trivial, but for range-based operations, which define a helper function, they
get a bit more involved).
Running the tests, or adding test cases
The test suite consists of two parts: the standard tests, which are run for
every opened merge request, and push to a pull request or the master branch,
and a set of regression tests that perform a bin-by-bin comparison of
the histograms produced with a simple plotter over a small dataset.
The former are closer to unit tests, and limited integration tests, so they
check test some components in isolation, and sequences of basic operations,
like constructing a few Selection
and
Plot
objects.
All the tests can easily be run with pytest, the standard tests with
pytest <bamboo_clone>/tests
and the additional regression tests with
pytest <bamboo_clone>/tests/test_plotswithreference.py --plots-reference=/home/ucl/cp3/pdavid/scratch/bamboo_test_reference
where the directory above is one set of reference histograms at the UCLouvain T2 grid site; details on producing such a set are given below. These are not fully integrated with Gitlab CI yet because they require access to CMS NanoAOD files. More generally, passing a specific file to pytest will make it run only the tests defined in that file.
Note
Tests are not only useful when developing new code.
They can also be very helpful in understanding some unexpected or buggy
behaviour, and pytest makes it very easy to run the tests, and add more:
just add a method starting with test_
in one of the test files,
with an assertion to check if the tests passes, or add a file with a name
starting with test_
to the tests
directory and define your test
cases there.
Contributing tests is one of the easiest ways to get to know the internals
and help with bamboo development, so more tests are always welcome.
The regression tests will by default use a temporary directory, so the output
is automatically removed when the test run finishes.
This can be changed by passing a directory to the --plots-output
argument.
To turn such an output directory into a new reference directory, two files
should be added, test_zmm_ondemand.yml
and test_zmm_postproc.yml
,
which are the configuration files for the on-demand and postprocessed runs,
respectively.
In fact the only output files that are used are the histogram files in
the respective results
directories, so the rest of the output directories
can, but needs not, be removed.
The T2_BE_UCL test configs use a single file of data, DoubleMuon for 2016 and DoubleEG for 2017, and 100k events from a Drell-Yan simulation sample for each of the two years, but any similar configuration should work. The postprocessing must add the full set of jet and MET kinematic variations.
The bin-by-bin comparison may also be useful for other contexts,
so it is made available as a command-line script in
<bamboo_clone>/tests/diffHistsAndFiles.py
.
Full documentation is available through the --help
command, but generally
it takes two directories with histograms, and will compare all histograms in
ROOT files present in both (if some ROOT files are present in one but not
the other directory, that will also be considered a failure).