User guide

This section contains some more information on doing your analysis with bamboo. It assumes you have successfully installed it following the instructions in the previous section.

The first thing to make sure is that bamboo can work with your trees. With CMS NanoAOD, many analysis use the same (or a very similar) tree format, which is why a set of decorators for is included in bamboo.treedecorators.decorateNanoAOD(); to make stacked histogram plots from them it is sufficient to make your analysis module inherit from bamboo.analysismodules.NanoAODHistoModule (which calls this method from its prepareTrees() method). Other types of trees can be included in a similar way, but a bit of development is needed to provided a more convenient way to do so (help welcome).

Running bambooRun

The bambooRun executable script can be used to run over some samples and derive other samples or histograms from them. It needs at least two arguments: a python module, which will tell it what to do for each event, and a configuration file with a list of samples to process, plot settings etc.

Typically, bambooRun would be invoked with

bambooRun -m module-specification config-file-path -o my-output

where module-specification is of the format modulename:classname, and modulename can be either a file path like somedir/mymodule.py or an importable module name like myanalysispackage.mymodule. This will construct an instance of the specified module, passing it any command-line arguments that are not used directly by bambooRun, and run it.

The default base module (bamboo.analysismodules.AnalysisModule, see below) provides a number of convenient command-line options (and individual modules can add more by implementing the addArgs() method).

  • the -h (--help) switch prints the complete list of supported options and arguments, including those defined by the module if used with bambooRun -h -m mymodule

  • the -o (--output) option can be used to specify the base directory for output files

  • the -v (--verbose) switch will produce more output messages, and also print the full C++ code definitions that are passed to RDataFrame (which is very useful for debugging)

  • the -i (--interactive) switch will only load one file and launch an IPython terminal, where you can have a look at its structure and test expressions

  • the --maxFiles option can be used to specify a maximum number of files to process for each sample, e.g. --maxFiles=1 to check that the module runs correctly in all cases before submitting to a batch system

  • the --eras option specifies which of the eras from the configuration file to consider, and which type of plots to make. The format is [mode][:][era1,era2,...], where mode is one of split (plots for each of the eras separately), combined (only plots for all eras combined) or all (both of these, this is the default).

  • the --distributed mode specifies how the processing should run (locally, on a cluster, …), see below.

  • the -t (--threads) option can be used to run in multi-threaded mode, for both the local or batch mode

Computing environment configuration file

For some features such as automatically converting logical filenames from DAS to physical filenames at your local T2 storage (or falling back to xrootd), submitting to a batch cluster etc., some information about the computing resources and environment is needed. In order to avoid proliferating the command-line interface of bambooRun, these pieces of information are bundled in a file that can be passed in one go through the --envConfig option. If not specified, Bamboo will try to read bamboo.ini in the current directory, and then $XDG_CONFIG_HOME/bamboorc (which typically resolves to ~/.config/bamboorc). Since these settings are not expected to change often or much, it is advised to copy the closest example (e.g. examples/ingrid.ini or examples/lxplus.ini) to ~/.config/bamboorc and edit if necessary.

Analysis YAML file format

The analysis configuration file should be in the YAML format. This was chosen because it can easily be parsed while also being very readable (see the YAML Wikipedia page for some examples and context) - it essentially becomes a nested dictionary, which can also contain lists.

Three top-level keys are currently required: tree with the name of the TTree inside the file (e.g. tree: Events for NanoAOD), samples with a list of samples to consider, and eras, with a list of data-taking periods and their integrated luminosity. For stacked histogram plots, a plotIt section should also be specified (the bamboo.analysisutils.runPlotIt() method will insert the files and plots sections and run plotIt with the resulting configuration; depending on the --eras option passed, per-era or combined plots will be produced, or both, which is the default). Each entry in the plots section will contain the combination of the settings explicitly passed to make1D(), those present in plotDefaults, and those specified under the plotdefaults block in the plotIt section of the analysis configuration file (in this order of precedence); if the values are callable, the result of calling them on the Plot is used (which may be useful to adjust e.g. the axis range to the binning; by default the binning range is used as x-axis range). The full list of plotIt configuration options can be found on this page.

Each entry in the samples dictionary (the keys are the names of the samples) is another dictionary. The files to processed can be specified directly as a list under files (with paths relative to the location of the config file, which is useful for testing), or absolute paths/urls (e.g. xrootd). If files is a string, it is taken as a file with a list of such paths/urls. For actual analyses, however, samples will be retrieved from a database, e.g. DAS or SAMADhi (support for the latter still needs to be implemented). In that case, the database path or query can be specified under db, e.g. db: das:/SingleMuon/Run2016E-Nano14Dec2018-v1/NANOAOD. The results of these queries can be cached locally by adding a dbcache top-level configuration entry, with a directory where the text files can be stored. For each sample a file <sample_name>.txt will be created, with a comment that contains the db value used to create it, such that changes can automatically be detected and the query redone, and the list of files. To force rerunning some or all queries, the corresponding files or the whole cache directory can be moved or deleted.

Which NanoAOD samples to use?

When analysing CMS NanoAOD samples there are two options: postprocessing the centrally produced NanoAOD samples with CRAB to add corrections and systematic variations as new branches, or calculating these on demand (see the corresponding recipes for more details). Which solution is optimal depends on the case (it is a trade-off between file size and the time spent on calculating the variations), but the latter is the easiest to get started with: just create some Rucio rules to make the samples available locally: the transfers are usually very fast—much faster than processing all the samples with CRAB. Tip: with Rucio containers you can group datasets and manage them together. Depending on the site policies you may need to ask for quota or approval of the rules.

Tip

Samples in DAS and SAMADhi rarely change, and reading a local file is almost always faster than doing queries (and does not require a grid proxy etc.), so especially when using many samples from these databases it is recommended to cache the file lists resulting from these results, by specifying a path under dbcache at the top level of the configuration file (see below for an example).

For data, it is usually necessary to specify a json file to filter the good luminosity blocks (and a run range to consider from it, for efficiency). If an url is specified for the json file, the file will be downloaded automatically (and added to the input sandbox for the worker tasks, if needed).

For the formatting of the stack plots, each sample needs to be in a group (e.g. ‘data’ for data etc.), which will be taken together as one contribution. The era key specifies which era (one of those specified in the eras section, see above) the sample corresponds to, and which luminosity value should be used for the normalisation.

For the normalization of simulated samples in the stacks, the number of generated evens and cross-section are also needed. The latter should be specified as cross-section with the sample (in the same units as the luminosity for the corresponding era), the former can be computed from the input files. For this, the bamboo.analysismodules.HistogramsModule base class will call the mergeCounters method when processing the samples, and the readCounters method to read the values from the results file - for NanoAOD the former merges the Runs trees and saves the results, while the latter performs the sum of the branch with the name specified under generated-events.

For large samples, a split property can be specified, such that the input files are spread out over different batch jobs. A positive number is taken as the number of jobs to divide the inputs over, while a negative number gives the number of files per job (e.g. split: 3 An era key is also foreseen (to make 2016/2017/2018/combined plots) - but it is currently ignored. will create three jobs to process the sample, while split: -3 will result in jobs that process three files each).

All together a typical analysis YAML file would look like the following (but with many more sample blocks, and typically a few era blocks; the plotIt section is left out for brevity).

tree: Events
eras:
  '2016':
    luminosity: 12.34
dbcache: dascache
samples:
  SingleMuon_2016E:
    db: das:/SingleMuon/Run2016E-Nano14Dec2018-v1/NANOAOD
    run_range: [276831, 277420]
    certified_lumi_file: https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/Collisions16/13TeV/ReReco/Final/Cert_271036-284044_13TeV_23Sep2016ReReco_Collisions16_JSON.txt
    era: 2016
    group: data

  DY_high_2017:
    db: das:/DYJetsToLL_M-50_TuneCP5_13TeV-amcatnloFXFX-pythia8/RunIIFall17NanoAODv4-PU2017_12Apr2018_Nano14Dec2018_new_pmx_102X_mc2017_realistic_v6_ext1-v1/NANOAODSIM
    era: 2017
    group: DY
    cross-section: 5765.4
    generated-events: genEventSumw
    split: 3

Tip

It is possible to insert the content of a configuration file into another, e.g. to separate or reuse the plot- and samples-related setings: simply use the syntax !include file.yml in the exact place where you would like to insert the content of file.yml.

Analysis module

For an analysis module to be run with bambooRun, it in principle only needs a constructor that takes an argument with command-line arguments, and a run method. bamboo.analysismodules provides a more interesting base class AnalysisModule that provides a lot of common functionality (most notably: parsing the analysis configuration, running sequentially or distributed (and also as worker task in the latter case), and provides addArgs(), initialize(), processTrees(), postProcess(), and interact(), interface member methods that should be further specified by subclasses (see the reference documentation for more details).

HistogramsModule does this for the stacked histogram plots, composing processTrees() from prepareTree() and definePlots(), while taking the JSON lumi block mask and counter merging into account. It also calls the plotIt executable from postProcess() (with the plots list and analysis configuration file, it has all required information for that). NanoAODHistoModule supplements this with the decorations and counter merging and reading for NanoAOD, such that all the final module needs to do is defining plots and selections, as in the example examples.nanozmumu. This layered structure is used such that code can be maximally reused for other types of trees.

For the code inside the module, the example is also very instructive:

def definePlots(self, t, noSel, sample=None, sampleCfg=None):
    from bamboo.plots import Plot, EquidistantBinning
    from bamboo import treefunctions as op

    plots = []

    twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(t.Muon) > 1 ])
    plots.append(Plot.make1D("dimu_M", op.invariant_mass(t.Muon[0].p4, t.Muon[1].p4), twoMuSel,
            EquidistantBinning(100, 20., 120.), title="Dimuon invariant mass", plotopts={"show-overflow":False}))

    return plots

The key classes are defined in bamboo.plots: Plot and Selection (see the reference documentation for details). The latter represents a consistent set of selection requirements (cuts) and weight factors (e.g. to apply corrections). Selections are defined by refining a “root selection” with additional cuts and weights, and each should have a unique name (an exception is raised at construction otherwise). The root selection allows to do some customisation upfront, e.g. the applying the JSON luminosity block mask for data. A plot object refers to a selection, and specifies which variable(s) to plot, with which binning(s), labels, options etc. (the plotOpts dictionary is copied directly into the plot section of the plotIt configuration file).

Histograms corresponding to systematic variations (of scalefactors, collections etc.—see below) are by default generated automatically alongside the nominal one. This can however easily be disabled at the level of a Selection (and, consequently, all Selection instances deriving from it, and all Plot instances using it) or a single plot, by passing autoSyst=False to the refine() or make1D() (or related) method, respectively, when constructing them; so setting noSel.autoSyst = False right after retrieving the decorated tree and root selection would turn disable all automatic systematic variations.

Specifying cuts, weight, and variables: expressions

The first argument to the definePlots() method is the “decorated” tree—a proxy object from which expressions can be derived. Sticking with the NanoAOD example, t.Muon is another proxy object for the muon collection (similarly for the other objects), t.Muon[0] retrieves the leading-pt muon proxy, and t.Muon[0].p4 its momentum fourvector. The proxies are designed to behave as much as possible as the value types they correspond to (you can get an item from a list, an attribute from an object, you can also work with numerical values, e.g. t.Muon[0].p4.Px()+t.Muon[1].p4.Px()) but for some more complex operations, specific functions are needed. These are as much as possible defined in the bamboo.treefunctions module, see Building expressions for an overview of all the available methods.

Ideally, the decorated tree and the bamboo.treefunctions module are all you ever need to import and know about the decorations. Therefore the best way to proceed now is get a decorated tree inside an IPython shell and play around. For bamboo.analysismodules.HistogramsModule this can always be done by passing the --interactive flag, with either one of (depending on if you copied the NanoAOD test file above)

bambooRun -m bamboo/examples/nanozmumu.py:NanoZMuMu --interactive --distributed=worker bamboo/tests/data/DY_M50_2016.root
bambooRun -m bamboo/examples/nanozmumu.py:NanoZMuMu --interactive bamboo/examples/test_nanozmm.yml [ --envConfig=bamboo/examples/ingrid.ini ] -o int1

The decorated tree is in the tree variable (the original TChain is in tup) and the bamboo.treefunctions module is there as op (the c_... methods construct a constant, whereas the rng_... methods work on a collection and return a single value, whereas the select() method returns a reduced collection (internally, only a list of indices to the passing objects is created, and the result is a proxy that uses this list). Some of the rng_... methods are extremely powerful, e.g. rng_find() and rng_max_element_by().

Tip

In addition to the branches read from the input tree, all elements of collections have an idx attribute which contains their index in the original collection (base), also in case they are obtained from a subset (with select() or a slice), differently ordered version (with sort()), or systematic variation (e.g. for jets) of the collection. This can be especially useful to ensure that two objects are (not) identical, or when directly comparing systematic variations. Similarly, all collections, selections, slices etc. have an idxs attribute, with the list of indices in the original collection.

This can also be exploited to precalculate an expensive quantity for a collection of objects (with map()), or even to evaluate a quantity for items passing different selections (e.g. the passing and failing selections), something like fun(passing.base[op.switch(op.rng_len(passing) > 0, passing[0].idx, failing[0].idx)]).

The proxy classes are generated on the fly with all branches as attributes, so tab-completion can be used to have a look at what’s there:

In [1]: tree.<TAB>
  tree.CaloMET                           tree.SoftActivityJetHT10
  tree.Electron                          tree.SoftActivityJetHT2
  tree.FatJet                            tree.SoftActivityJetHT5
  tree.Flag                              tree.SoftActivityJetNjets10
  tree.HLT                               tree.SoftActivityJetNjets2
  tree.HLTriggerFinalPath                tree.SoftActivityJetNjets5
  tree.HLTriggerFirstPath                tree.SubJet
  tree.Jet                               tree.Tau
  tree.L1Reco_step                       tree.TkMET
  tree.MET                               tree.TrigObj
  tree.Muon                              tree.deps
  tree.OtherPV                           tree.event
  tree.PV                                tree.fixedGridRhoFastjetAll
  tree.Photon                            tree.fixedGridRhoFastjetCentralCalo
  tree.PuppiMET                          tree.fixedGridRhoFastjetCentralNeutral
  tree.RawMET                            tree.luminosityBlock
  tree.SV                                tree.op
  tree.SoftActivityJet                   tree.run
  tree.SoftActivityJetHT

In [1]: anElectron = tree.Electron[0]

In [2]: anElectron.<TAB>
   anElectron.charge                   anElectron.eInvMinusPInv            anElectron.mvaSpring16HZZ_WPL
   anElectron.cleanmask                anElectron.energyErr                anElectron.mvaTTH
   anElectron.convVeto                 anElectron.eta                      anElectron.op
   anElectron.cutBased                 anElectron.hoe                      anElectron.p4
   anElectron.cutBased_HEEP            anElectron.ip3d                     anElectron.pdgId
   anElectron.cutBased_HLTPreSel       anElectron.isPFcand                 anElectron.pfRelIso03_all
   anElectron.deltaEtaSC               anElectron.jet                      anElectron.pfRelIso03_chg
   anElectron.dr03EcalRecHitSumEt      anElectron.lostHits                 anElectron.phi
   anElectron.dr03HcalDepth1TowerSumEt anElectron.mass                     anElectron.photon
   anElectron.dr03TkSumPt              anElectron.miniPFRelIso_all         anElectron.pt
   anElectron.dxy                      anElectron.miniPFRelIso_chg         anElectron.r9
   anElectron.dxyErr                   anElectron.mvaSpring16GP            anElectron.sieie
   anElectron.dz                       anElectron.mvaSpring16GP_WP80       anElectron.sip3d
   anElectron.dzErr                    anElectron.mvaSpring16GP_WP90       anElectron.tightCharge
   anElectron.eCorr                    anElectron.mvaSpring16HZZ           anElectron.vidNestedWPBitmap

For NanoAOD the content of the branches is documented here. More information about the central NanoAOD production campaigns is provided here.

In addition to the branches present in the NanoAOD, the following attributes are added for convenience:

  • p4 if pt, eta, phi, and mass attributes are defined. pt and mass are optional, such that this also works for TrigObj and various kinds of MET.

  • idx for elements of containers

  • for GenPart: parent, which refers to the parent or mother particle (the presence can be tested by comparing its idx to -1), and ancestors, the range of all ancestors—this does check the validity, so it may be empty.

Processing modes

The usual mode of operation is to 1. parse the analysis configuration file, 2. execute some code for every entry in each of the samples, and then 3. perform some actions on the aggregated results (e.g. produce nice-looking plots from the raw histograms). Since the second step is by far the most time-consuming, but can be performed indepently for different samples (and even entries), it is modeled as a list of tasks (which may be run in parallel), after which a postprocessing step takes the results and combines them. The latter step can also be run separately, using the results of previously run tasks, assuming these did not change.

More concretely, for e.g. histogram stack plots, the tasks produce histograms while the postprocessing step runs plotIt, so with the --onlypost option the normalization, colors, labels etc. can be changed without reprocessing the samples (some tips on additional postprocessing can be found in this recipe).

The task processing itself can be run in three different modes, depending on the option passed to --distributed:

By default (--distributed=sequential), every sample is processed sequentially. This is useful for short tests or for small tasks that don’t take very long. The execution can be made quicker by running in multithreaded mode using -t.

Alternatively, it is possible to process multiple tasks in parallel using --distributed=parallel. This will first create the processing configuration for every sample, before starting the actual processing. Again, the executing of these tasks can be made to run on multiple threads using -t.

In practice, you will most likely want to submit independent tasks to a computing cluster using a batch scheduler (currently HTCondor and Slurm are supported). Bamboo will submit the jobs, monitor them, and combine the results when they are finished. More information about monitoring and recovering failed batch jobs is given in the corresponding recipe. By default one batch job is submitted for each input sample, unless there is a split entry different from one for the sample, see below for the precise meaning.

Finally, it is possible to offload the computations to a computing cluster using modern distributed computing tools such as Dask or Spark. This means that Dask/Spark will take care of splitting the input data, launching and monitoring jobs, and retrieving the results. This mode can be activated using the --distrdf-be argument, and can work both with --distributed=sequential (in which case every sample will be processed sequentially by the whole cluster) or with --distributed=parallel (in which case the processing of all the input samples will happen in parallel). More information about how to configure bamboo for Dask/Spark can be found here.

Examples

Some more complete examples, based on open data RDataFrame tutorials, are available in this repository (they can be run on binder without installing anything locally).

The recipes page has a collection of common analysis tasks, with a recommended implementation, and pointers to the relevant helper functions; it may good to skim through to get an idea of what a typical analysis implementation will look like.