mva

Introduction

protopipe.mva contains utilities to build models for regression or classification problems. It is based on machine learning methods available in scikit-learn. Internally, the tables are dealt with the Pandas Python module.

For each type of camera a regressor/classifier should be trained. For both type of models an average of the image estimates is later computed to determine a global output for the event (energy or score/gammaness).

Details

The class TrainModel uses a training sample composed of gamma-rays for a regression model. In addition of a gamma-ray sample, a sample of protons is also used to build a classifier. The training of a model is done via the GridSearchCV algorithm which allows to find the best hyper-parameters of the models.

The RegressorDiagnostic and ClassifierDiagnostic classes can be used to generate several diagnostic plots for regression and classification models, respectively.

Proposals for improvements and/or fixes

Note

This section has to be moved to the repository as a set of issues.

  • Improve split of training/test data. For now the split of the data is done according to the run number, e.g. training data will be N% of the first runs (sorted by run numbers) and test data will be the remaining runs. Really easy to improve with scikit-learn. But I wanted to keep the information about evt_id and the obs_id in order to combine the data and produce diagnostic plot at the level of event (not implemented yet), which is more complex that what scikit does.

  • Implement event-level diagnostic.

  • To train the energy estimator, the Boosted Decision Tree method is hard-coded.

  • For the diagnostic, in both case we might want to implement diagnostics at the level of events but for this we need to link the event Id with the observation Id as well as the image parameters to split and combine the model output. It needs some thoughts…

Reference/API

protopipe.mva Package

Classes to buil models based on machine learning methods.

Functions

get_evt_model_output(data_dict[, …])

Returns DataStore with reco energy + score/target columns of model at the level-event.

get_evt_subarray_model_output(data[, …])

Returns DataStore with keepcols + score/target columns of model at the level-subarray-event.

load_obj(name)

Load object in binary

make_cut_list(cuts)

plot_distributions(feature_list, data_list)

Plot feature distributions for several data set.

plot_hist(ax, data, nbin, limit[, norm, …])

Utility function to plot histogram

plot_profile(ax, data, xcol, ycol, nbin, limit)

Plot profile of a distribution

plot_roc_curve(ax, model_output, y, **kwargs)

Plot ROC curve for a given set of model outputs and labels

prepare_data(ds, cuts[, label])

Add variables in data frame

save_obj(obj, name)

Save object in binary

split_train_test(ds, train_fraction, …)

Classes

BoostedDecisionTreeDiagnostic()

Class producing diagnostic plot for the BDT method

ClassifierDiagnostic(model, …[, …])

Class to plot several diagnostic plot for classification.

ModelDiagnostic(model, feature_name_list, …)

Base class for model diagnostics.

RegressorDiagnostic(model, …)

Class to plot several diagnostic plot for regression

TrainModel(case, feature_name_list[, …])

Train classification or regressor model.