`professor.data`¶

This sub-package contains the classes used for loading data from disk and to re-structure them for tuning.

The loading of files from disk is done in DataProxy (or RivetDataProxy). Files are only read if necessary and then cached. A TuneData object holds all information necessary for a single tune or for comparing MC data (held in MCData) with reference data. Basically, TuneData is just a list of BinProps objects, one object for each bin that is in the requested observables.

The data flow looks like this:

+-----+   +------+  +----+
| ref |   | ipol |  | mc |  (files on disk)
+-----+   +------+  +----+
   |       |          |
   |    +---------+ +--------+
   |    | IpolSet | | MCData |
   |    +---------+ +--------+
   |       |          |
+-----------+         |
| DataProxy |---------+
+-----------+
   |
+----------+
| TuneData |      (list of BinProps)
+----------+
   |
+------------+
| GoF object |    (this compares MC/interpolation with reference)
+------------+

class professor.data.DataProxy¶

Bases: object

Central object for loading data from the file system.

Three types of data are handled:

Reference data :: TODO
MC data :: Different types of MC data can be stored. The MC data is stored in a dict {type-ID => MCData} . type-IDs are for example ‘sample’ or ‘scan’.
Interpolations :: TODO

See also

MCData: Abstraction of a MC data subdirectory.

Methods

addMCData(mcdata, datatype)¶

Add MC data of given data type.

Add a MC data interface to the internal storage dictionary. If an entry for datatype already exists it will be overwritten!

Parameters :

mcdata : MCData (or subclass)

The MC data to add.

datatype : str

The MC data type, e.g. ‘sample’ or ‘scan’ or ‘tunes’.

Raises :

TypeError :

If mcdata has wrong type.

static getBinID(histo, ibin)¶

Get a canonical bin id of the form Analysis/HistoID:BinIdx .

Parameters :

histo : Histo

Histogram.

ibin : int

Bin index.

static getBinIndex(binid)¶

Get the bin index from a canonical bin ID.

Parameters :

binid : str

The bin ID.

Returns :

——- :

index : int

getInterpolationSet(ipolcls, runs)¶

Get an InterpolationSet.

This is loaded from disk on-the-fly.

Parameters :

ipolcls : class

The interpolation method class.

runs : list, str

The runs that are used as anchor points for the interpolation. Can be a list of strings or a single string of colon-separated run keys.

getIpolFilePath(ipolcls, runs, output=False)¶

Return the canonical path for an interpolation pickle.

Parameters :

ipolcls : class

The interpolation method class. Must have a ‘method’ attribute.

runs : list, str

The runs that are used as anchor points for the interpolation. Can be a list of strings or a single string of colon-separated run keys.

getIpolPath()¶

getMCData(datatype='sample', checkAIDA=True)¶

Get MC data of the given type.

Parameters :

datatype : str, optional

The MC data type, e.g. ‘sample’ or ‘scan’ (default is ‘sample’.

Returns :

mcdata : MCData

The datatype MC data.

Raises :

DataProxyError :

If no MC data of type datatype is available.

getOutputPath()¶

static getPathsFromCLOptions(opts)¶

Return a dict with the data paths specified on command line.

The dictionary has the following 4 keys:

mc

ref

scan

ipol

Each value can contain None meaning that the respective command-line option is not available or that a value could not be constructed from e.g. DATADIR/mc .

getRefData()¶

Get the dictionary of all loaded reference histograms, indexed by histogram path.

Returns :

refhistos : dict(path => histo.Histo)

The reference histograms.

getRefHisto(histopath)¶

Get a reference histogram.

Parameters :

histopath : str

A histogram path of the form ‘/Analysis/HistoID’.

Returns :

histogram : histo.Histo

The reference histogram.

Raises :

DataProxyError :

If self.refpath is not set.

KeyError :

If histopath is not available.

getRefPath()¶

getTuneData(withref=True, withmc=None, useipol=None, useruns=None, useobs=None)¶

Return a TuneData object with the desired data.

The kind of data that is given to TuneData can be steered via the (optional) flags. Depending on the kind of computation (calculating interpolation coefficients/minimising/...) different kinds of data must be turned on.

This is the central data preparation function.

Parameters :

withref : bool, optional

Equip TuneData with reference data (the default is True).

withmc : {str, None}, optional

If not None, the type of MC data that is stored in the TuneData, e.g. ‘sample’. The default is None.

useipol : {interpolation_class, None}, optional

If not None, the interpolation method class used for the per-bin interpolations. Only the method attribute is important because this is used to construct the file name of the pickle file.

useruns : {list of str, None}, optional

The run numbers used for interpolation. Can be None if withmc is given. In this case, all available MC runs are used.

useobs : {list of str, None}, optional

The observables to use. Can be None if withmc is given. In this case, all available observables in the MC data are used.

ipolpath¶: Base directory for interpolation set files

listInterpolationSets()¶

Return a list of all InterpolationSets in the ipol directory.

Raises :

DataProxyError :

If self.ipolpath is not set.

classmethod mkFromCLOptions(opts, checkAIDA=True)¶

Build DataProxy from CL options that were prepared with addDataCLOptions.

Only the paths are set in the returned DataProxy for which the parser has an according option.

See also

addDataCLOptions: Add a data location command-line option group to an OptionParser.
getPathsFromCLOptions: Get a dict of data-location paths from command line options.

outdir¶: Base directory for output

refpath¶: Base directory for reference data files

setDataPath(base)¶

Set data location paths rooted at base.

Sets the data location paths for reference data (base/ref), MC sample (base/mc) and interpolation storage (base/ipol/).

Parameters :

base : str

Base path for data locations.

setIpolPath(path)¶

setMCPath(path, datatype='sample', checkAIDA=True)¶

Add MC data of given type rooted at path.

Parameters :

path : str

Base directory of the MC data.

datatype : str, optional

The type identifier of the MC data, e.g. ‘sample’ or ‘linescan’. The default is ‘sample’.

Raises :

IOTestFailed :

If path is not a readable directory.

setOutputPath(path)¶

setRefPath(path)¶

static splitBinID(binid)¶

Split a bin ID in observable and bin index.

Parameters :

binid : str

The bin ID.

Returns :

——- :

observable : str

index : int

class professor.data.proxy.RivetDataProxy¶

Bases: professor.data.proxy.DataProxy

Data proxy that loads the reference data from the files distributed with rivet.

Methods

getRefData()¶

getRefHisto(histopath)¶

Get a reference histogram.

Parameters :

histopath : str

A histogram path of the form ‘/Analysis/HistoID’.

Returns :

histogram : histo.Histo

The reference histogram.

Raises :

KeyError :

If histopath is not available.

classmethod mkFromCLOptions(opts, checkAIDA=True)¶

Build DataProxy from CL-options that were prepared with addDataCLOptions.

Only the paths are set in the returned DataProxy for which the parser has an according option.

See also

addDataCLOptions: Add a data location command-line option group to an OptionParser.
getPathsFromCLOptions: Get a dict of data-location paths from command line options.

class professor.data.MCData(base, checkAIDA=True)¶

Bases: object

Interface for a directory with MC generated data.

MCData abstracts a directory with MC generated data with a layout following:

base/run1
     run2
     ...

Data is read from the filesystem only if necesarry.

Variables:	basepath – Directory path within which all runs are located (typically basepath/mc). availableruns – List of valid run names, based on a scan of valid run dirs found in basepath.

Methods

availablehistos¶: The available histogram names (sorted).

getAvailableObservables(filtered=True)¶

Get a sorted list with the available observables.

The observables are taken from the first available MC run data.

By default only the observables containing valid numerical data (i.e. no NaN’s) are returned.

Parameters :

filtered : bool, optional

Return only histograms that contain valid (i.e. not NaN) data (default). If set to False all available observables are returned.

getParameterBounds(runs=None)¶

Get the extremal parameter bounds of runs.

Returns :	bounds : ParameterRange

getParameterCmp(run=None)¶

getParameterNames()¶

getRunHistos(run, filtered=False)¶

Return the {obsname => Histo} dict for given run.

Parameters :

run : str

Run ID.

filtered : bool, optional

Return only histograms that contain valid (i.e. not NaN) data. By default all histograms are returned (for the sake of speed).

Returns :

histograms : dict

Dictionary that map histogram paths to Histo instances.

getRunParams(run, retall=False)¶

Get the run parameters.

Parameters :

run : str

Run ID.

Returns :

parameters : ParameterPoint

The parameter values.

getScanParam(run)¶

isValidRunDir(runid=None, runpath=None, checkAIDA=True)¶

Check that the run directory is valid.

Checks for an out.aida file and an used_params file.

The run can be specified by runid or runpath.

Parameters :

runid : str

The ID of the run, i.e. the subdirectory name.

runpath : str

The full path to the rundirectory. This is used in the ManualMCData class.

loadAllRuns(loadhistos=True)¶

Load the data for all available runs.

Parameters :

loadhistos : bool, optional

Turn loading histogram data on (default) or off.

See also

loadRun: Load a single run.
loadAllThreaded: Load all runs threaded, useful if IO lags are huge, e.g. with network file storage.

loadAllThreaded(loadhistos=True, numthreads=8)¶

Load the data for all available runs (multi-threaded).

This is only useful if IO lags are huge. Otherwise the Python thread overhead makes this more time-consuming than loadAll.

Parameters :

loadhistos : bool, optional

Turn loading histogram data on (default) or off.

numthreads : int, optional

Number of threads (default: 8).

See also

loadRun: Load a single run.
loadAll: Load all runs sequentially.

loadRun(run, loadhistos=True)¶

Load data for a run.

Parameters :

run : str

The run identifier to load.

loadhistos : bool, optional

Turn loading histogram data on (default) or off.

loadedruns¶: The currently loaded run numbers (sorted).

class professor.data.ManualMCData(runpathmap=None)¶

Bases: professor.data.mcdata.MCData

addRunPath(runid, path)¶

availableruns¶

getParameterCmp(runid=None)¶

loadRun(runid, loadhistos=True)¶: Load data for run.

class professor.data.TuneData(dataproxy, withref=True, withmc=None, useipol=None, useruns=None, useobs=None)¶

Bases: dict

Container for data for one choice of runs.

The bin ids (e.g. /Path/To/Obs:index) are mapped on BinProps instances.

Attributes

runnums	list	Sorted list run identifiers.
hasref, hasmc, hasipol	bool	Flags that are True if the object contains that type of data.
paramranges	ParameterRange, None	The range of parameters spanned by the used MC runs. Only available if MC or ipol data was included.

Methods

Make a TuneData object with the desired data.

The kind of data that is given to TuneData can be steered via the (optional) flags. Depending on the kind of computation (calculating interpolation coefficients/minimising/...) different kinds of data must be turned on.

This is the central data preparation function.

Parameters :

withref :

Equip TuneData with reference data.

withmc : str, optional

Equip TuneData with mc data of the given type, e.g. ‘sample’. Use None to disable storing MC data. This is the default.

useipol : class, optional

The interpolation method given by the class or None (=> no interpolation data is loaded). None is the default.

useruns : list of str

List of MC run numbers to use or None (=> use all runs from mc data given with withmc).

useobs : list of str

List of observables to use or None (=> use all observables from mc data given with withmc).

Raises :

ArgumentError :

If run numbers (if needed) or observables are not specified and cannot be guessed.

Methods

applyObservableWeights(weightmanager)¶

Set the bin weights.

Parameters :	weightmanager : WeightManager

filteredValues()¶: Return an iterator with the bin properties without vetoed, zero-weighted.

getBinIDs(obs)¶: List of all binIDs for observable `obs’.

getBinProps(obs)¶: List of all BinProps for observable `obs’.

getInterpolationHisto(observable, params)¶

Interpolation-prediction for observable at params.

Parameters :

observable : str

Path of the observable.

params : MinimizationResult, ParameterPoint, dict

The values of MC model parameters where the interpolation is evaluated.

Returns :

histogram : lighthisto.Histo

Interpolated histogram.

getObservables()¶

ipolmethod¶

The interpolation method.

Returns the interpolation method of the first bin. It is assumed that all bin properties use the same interpolation method.

Raises :

DataProxyError :

If no interpolations were stored.

numParams()¶

observables¶

vetoEmptyErrors()¶

Veto bins with zero reference error.

TODO: This is a nasty heck way of identifying broken (for some reason) bins. We should get rid of it!

class professor.data.BinProps(refbin, mcdict, ipol, **kwargs)¶

Bases: object

Container for all data related to a bin needed to do a minimisation.

A container for all the variants on a distribution bin: its weight, its reference value and errors, a collection of its simulated equivalents from a set of MC runs, and an interpolation function for that bin, based on optimising the fit to a sampling of MC points in the parameter space.

At the moment the following is stored: Attributes ———- refbin: lighthisto.Bin

The reference bin.

mcdict : dict {str => lighthisto.Bin}: Map for run numbers on MC bins.
ipol: The interpolation for this bin.
veto : bool: Flag for vetoing this bin in the GoF calculation.

weight : float sqrtweight : float

The weight of this bin in GoF calculation.

binid : str: The bin ID of this bin of the form ‘/Analysis/Observable:BinIndex’.

Methods

binid¶

getBinCenter()¶

getProperty(propname)¶: Get a generic name=value bin property. Return None if not found.

ipol¶

mcdict¶

refbin¶

setProperties(propdict)¶: Set a dictionary of generic name=value bin properties.

setProperty(propname, propvalue)¶: Set a generic name=value bin property.

setSqrtWeight(sw)¶

setWeight(w)¶

sqrtweight¶

veto¶

weight¶

class professor.data.WeightManager¶

Bases: object

This simple object loads observable weight/property files and stores a dictionary with observable:Weight pairs

Methods

addBinRangeWeight(observable, binrange=(-inf, inf), weight=1.0, **kwargs)¶

Set the weights for bins of ‘observable’ in ‘binrange’.

Parameters :

observable : str

Path of the observable.

binrange : tuple of floats

The x-value bin range.

weight : float

Weight for the bins.

kwargs : dict

Extra named arguments, passed to be Weight properties with those names.

getWeight(obs, obsvalue=None)¶

loadWeightsFile(wfile)¶

classmethod mkFromFile(path)¶

observables¶

Indexing operator for weight lookup. Also useable as wm[“obsname”].

If obsvalue is not given or is None, this function returns a Weight object, or None if there is no matching observable to the obs string. If obsvalue is given, return the numerical weight for that observable value, obtained via the Weights.getWeight method.

class professor.data.Weight(obs)¶

Bases: object

A simple object that holds a dict with binrange:weight pairs.

The weights have been extended to be general bin properties since the first design, and a reworking of this class design is probably overdue.

Methods

binRanges()¶

getProperties(bincenter)¶: Get the properties for a given observable value, excluding the “weight” property.

getWeight(bincenter)¶: Evaluate the weight for bincenter by iterating over binrange:Weight definitions If the bincenter is outside the ranges, return 0

setProperties(binrange, *args, **kwargs)¶: Set several properties at once, by supplying either a single dict object or via keyword arguments.

setProperty(binrange, propname, propvalue)¶: Set a property for the given bin range. The numerical weight is the most common property, for which the propname is “weight”.

setWeight(bincenter, weight)¶: Set the bin range weight property.

professor.data.addDataCLOptions(parser, ref=False, mc=False, ipol=False, scan=False)¶

Add data location options to command-line option parser.

Use the flags ref, mc, ... to include data locations as needed. Set only those flags to True that are actually needed by a script to keep the CL interface clean.

See also

DataProxy.fromCLOptions: Build a DataProxy instance from command line options.
DataProxy.getPathsFromCLOptions: Get a dict of data-location paths from command line options.
addRunCombsCLOptions: Add the standard CL option for loading lists of run combinations.

professor.data.addRunCombsCLOptions(parser)¶

Add run combination options to command-line option parser.

See also

addDataCLOptions: Add standard CL options for data, MC, ipol, etc. directories.

`professor.data`¶

Previous topic

Next topic

This Page

Navigation

professor.data¶

Previous topic

Next topic

This Page

Quick search

Navigation

`professor.data`¶