Advanced topics
Loading (and using) C++ modules
The bamboo.root
module defines a few thin wrappers to load additional
libraries or headers with the ROOT interpreter (PyROOT makes them directly
accessible through the global namespace, which can be imported as gbl
from
bamboo.root
.
The bamboo.root.loadDependency()
allows to load a combination of
headers and libraries; it is best called through the backend’s
addDependency()
method,
which will also correctly register all components for the compilation of
standalone executables, when using the compiled backend (see below).
As an example, the library that contains the dictionaries for the classes
used in Delphes output trees can be added as follows:
be.addDependency(libraries="Delphes")
In this specific case, bamboo.root.addLibrary("libDelphes")
should also be
added to the analysis module’s
initialize()
method, to make
sure the library is added before any file is opened.
For a module that is used in calculating expressions, it is sufficient to load
it only from the prepareTree()
or processTrees()
method
(repeated loads should not cause problems).
C++ methods can be used directly from an expression.
Non-member methods that are known by the interpreter (e.g. because the
corresponding header has been added with bamboo.root.loadHeader()
),
can be retrieved with bamboo.treefunctions.extMethod()
, which returns
a decorated version of the method.
It is often useful to define a class that stores some parameters, and then call
a member method with event quantities to obtain a derived quantity (this is
also the mechanism used for most of the builtin corrections).
In order to use such a class, its header (and shared library, if necessary)
should be loaded as above, and an instance defined with
bamboo.treefunctions.define()
, e.g.
myCalc = op.define("MyCalculatorType", 'const auto <<name>> = MyCalculatorType("test");')
myCorrection = myCalc.evaluate(tree.Muon[0].pt, tree.Muon[1].pt)
Warning
With implicit multi-threading enabled, only thread-safe methods may be called in this way (e.g. const member methods, without global or member variables used for caching).
Note
The usual logic to avoid redefinition of these variables is applied.
In cases like above where all parameters are supplied at once, this will
work as expected.
If the calculator is further configured by calling member methods (it can
be accessed directly through PyROOT), it is safer to create a unique
instance for each sample, e.g. by adding a comment that contains the sample
name at the end of the declaration (an optional nameHint
argument can
be given to make the generated code more readable, but this will be ignored
in case the declaration string is the same).
Distributed RDataFrame
This adds support for DistRDF,
to distribute the computations dynamically to a cluster without having to manage the jobs in bamboo manually.
The backend handling the distribution is either Dask (distributed)
(typically with dask-jobqueue) or
Spark.
Through dask-jobqueue, “regular” batch systems based on Slurm or HTCondor are supported.
The long-term goal is to deprecate the batch job management (driver
mode) in Bamboo entirely, and rely only on DistRDF.
This section describes how to configure bamboo to uses these systems efficiently.
Warning
These features should still be considered as experimental. More feedback is welcome on how to fine-tune these systems in practice.
- To use DistRDF, call
bambooRun
with the additional--distrdf-be <BE>
argument, where<BE>
can be one of: dask_local
: local-machine multiprocessing, useful for testing but functionally equivalent to –threadsdask_scheduler
: connect to an existing Dask scheduler for your clusterdask_slurm
ordask_condor
: Dask handles job submission to a Slurm or HTCondor clusterspark
: connect to an existing Spark cluster (or spawn a locally running cluster)
DistRDF works with both default (sequential) processing mode, where one sample is processed after the other,
and with --distributed parallel
, where the graphs for all samples are first built, before processing all samples in parallel.
Obviously, --distributed driver
does not make sense here.
To configure the distributed backend, add a corresponding section in the environment file, e.g.:
[dask_slurm]
adapt_max = 100 ; submit max. 100 jobs
Have a look at the example at examples/distributed.ini for more configuration options.
For dask-jobqueue, you will in addition need to configure the submission to the Slurm/HTCondor job scheduler using a jobqueue.yml file placed into ~/.config/dask/. Have a look at the documentation of dask-jobqueue and at the example at examples/dask_jobqueue.yaml.
Ordering selections and plots efficiently
Internally, Bamboo uses the RDataFrame class to process the input samples and
produce histograms or skimmed trees—in fact no python code is run while
looping over the events: Bamboo builds up a computation graph when
Selection
and Plot
objects are defined by the analysis module’s
definePlots()
method,
RDataFrame compiles the expressions for the cuts and variables, and the input
files and events are only looped over once, when the histograms are retrieved
and stored.
In practice, however, there are not only Filter
(Selection
) and Fill
(Plot
) nodes in the computation graph, but also
Define
nodes that calculate a quantity based on other columns and make
the result available for downstream nodes to use directly.
This is computationally more efficient if the calculation is complex enough.
Bamboo tries to make a good guess at which (sub-)expressions are worth
pre-calculating (typically operations that require looping over a collection),
but the order in which plots and selections are defined may still help to avoid
inserting the same operation twice in the computation graph.
The main feature to be aware of is that RDataFrame makes a node in the
computation graph for every Define
operation, and the defined column can
only be used from nodes downstream of that.
Logically, however, all defined columns needed for plots or sub-selections of
one selection will need to be evaluated for all events passing this selection,
and the most efficient is to do this only once, so ideally all definitions
should be inserted right after the Filter
operation of the selection, and
before any of the Fill
and subsequent Filter
nodes.
This is a bit of a simplification: it is possible to imagine cases where it can
be better to have a column only defined for the sub-nodes that actually use it,
but then it is hard to know in all possible cases where exactly to insert the
definitions, so the current implementation opts for a simple and predictable
solution: on-demand definitions of subexpressions are done when
Plot
and Selection
objects
are constructed, and they update the computation graph node that other nodes
that derive from the same selection will be based on.
A direct consequence of this is that it is usually more efficient to first
define plots for a stage of the selection, and only then define refined
selections based on it—otherwise the subselection will be based on the
node without the columns defined for the plots and, in the common case where
the same plots are made at different stages of the selection, recreate nodes
with the same definitions in its branch of the graph.
As an illustration, the pseudocode equivalent of these two cases is
## define first subselection then plots
## some_calculation(other_columns) is done twice
if selectionA:
if selectionB:
myColumn1 = some_calculation(other_columns)
myPlot1B = makePlot(myColumn1)
myColumn2 = some_calculation(other_columns)
myPlot1A = makePlot(myColumn2)
## define first plots then subselection
## some_calculation(other_columns) is only done once
if selectionA:
myColumn1 = some_calculation(other_columns)
myPlot1A = makePlot(myColumn1)
if selectionB:
myPlot1B = makePlot(myColumn1)
This is why it is advisable to reserve the
definePlots()
method of the
analysis module for defining event and object container selections, and define
helper methods that declare the plots for a given selection—with a
parameter that is inserted in the plot name to make sure they are unique, if
used to define the same plots for different selection stages, e.g.
def makeDileptonPlots(self, sel, leptons, uname):
from bamboo.plots import Plot, EquidistantBinning
from bamboo import treefunctions as op
plots = [
Plot.make1D("{0}_llM".format(uname),
op.invariant_mass(leptons[0].p4, leptons[1].p4), sel,
EquidistantBinning(100, 20., 120.),
title="Dilepton invariant mass",
plotopts={"show-overflow":False}
)
]
return plots
def definePlots(self, t, noSel, sample=None, sampleCfg=None):
from bamboo import treefunctions as op
plots = []
muons = op.select(t.Muon, lambda mu : op.AND(mu.p4.Pt() > 20., op.abs(mu.p4.Eta() < 2.4)))
twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(muons) > 1 ])
plots += self.makeDileptonPlots(twoMuSel, muons, "DiMu")
jets = op.select(t.Jet, lambda j : j.p4.Pt() > 30.)
twoMuTwoJetSel = twoMuSel.refine("twoMuonsTwoJets", cut=[ op.rng_len(jets) > 1 ])
plots += self.makeDileptonPlots(twoMuTwoJetSel, muons, "DiMu2j")
return plots
Finally, there are some cases where the safest is to force the inclusion of a
calculation at a certain stage, for instance when performing expensive function
calls, since the default strategy is not to precalculate these because there are
many more function calls that are not costly.
A prime example of this is the calculation of modified jet collections with e.g.
an alternative JEC aplied, which is done in a separate C++ module (see below),
and is probably the slowest operation in most analysis tasks.
The definition can be added explicitly under a selection by calling the
bamboo.analysisutils.forceDefine()
method, e.g. with
for calcProd in t._Jet.calcProds:
forceDefine(calcProd, mySelection)
Different backends
Different approaches for converting plots and selections to an RDataFrame
graph have been implemented, each with some advantages and disadvantages.
The --backend
option of the bambooRun
command allows to select one.
The default approach is to generate the necessary helper methods and define
JITted nodes whenever a plot or selection is added.
This has the advantage that if there is a problem that can be detected at this
stage, it is easy to trace back.
The lazy backend (--backend==lazy
) instead waits for all plots and
selections to be defined before constructing the graph.
In principle, that ensures the order from the previous section, so depending
on the case it will generate a more efficient graph.
This uses the same JITted Define
and Filter
nodes as the default.
The experimental compiled backend takes things a step further, by generating the full C++ source for a standalone executable, essentially eliminating the need for any type inference at runtime. The structure of the graph will be identical to the one from the lazy backend, but with compiled instead of JITted nodes.
Note
Event processing with such an executable will be faster—in many cases by a significant amount—but the time needed for compilation, and especially the memory usage during compilation and linking, may be non-negligible. A few dynamic features (e.g. progress printing) are currently disabled as well. More efficiently using the compiled backend is closely tied to reusing the same executable across batch jobs and for different samples, and compiling different executables in parallel. These are planned, but they need some restructuring of the code, which will be done in steps as they will also allow for some other new features to be added, e.g. only producing a subset of the histograms, and using distributed RDF.
The compiled backend takes a few additional options, which can be passed by
setting its attributes (a reference to the backend should be returned by the
prepareTree()
method).
These are:
cmakeConfigOptions
: options passed to CMake when configuring the build, in addition to the path and the variables needed to pick up the bamboo C++ extensions (default:["-DCMAKE_BUILD_TYPE=Release"]
, which implies-O3
)cmakeBuildOptions
: options passed tocmake --build
when compiling the executable
In case anything goes wrong during the configuration or build, the temporary directory will be saved, and the errors and path printed in the log file.