Data training

protopipe.scripts.data_training is used to build tables of image and shower parameters that will be further used to train energy and particle classification estimators for each camera type.

Note

In the current version of the pipeline, the particle classification model needs uses the estimate of the particle’s energy as one of the parameters.
When training the data for that model you will need to specify the boolean estimate_energy parameter as well as the directory where the model is saved via the regressor_dir option.

By invoking the help argument, you can get help about how the script works:

usage: data_training.py [-h] --config_file CONFIG_FILE -o OUTFILE
                    [-m MAX_EVENTS] [-i INDIR]
                    [-f [INFILE_LIST [INFILE_LIST ...]]]
                    [--cam_ids [CAM_IDS [CAM_IDS ...]]]
                    [--wave_dir WAVE_DIR] [--wave_temp_dir WAVE_TEMP_DIR]
                    [--wave | --tail] [--debug] [--save_images]
                    [--estimate_energy ESTIMATE_ENERGY]
                    [--regressor_dir REGRESSOR_DIR]

optional arguments:
-h, --help            show this help message and exit
--config_file CONFIG_FILE
-o OUTFILE, --outfile OUTFILE
-m MAX_EVENTS, --max_events MAX_EVENTS
                      maximum number of events considered per file
-i INDIR, --indir INDIR
-f [INFILE_LIST [INFILE_LIST ...]], --infile_list [INFILE_LIST [INFILE_LIST ...]]
                      give a specific list of files to run on
--cam_ids [CAM_IDS [CAM_IDS ...]]
                      give the specific list of camera types to run on
--wave_dir WAVE_DIR   directory where to find mr_filter. if not set look in
                      $PATH
--wave_temp_dir WAVE_TEMP_DIR
                      directory where mr_filter to store the temporary fits
                      files
--wave                if set, use wavelet cleaning -- default
--tail                if set, use tail cleaning, otherwise wavelets
--debug               Print debugging information
--save_images         Save also all images
--estimate_energy ESTIMATE_ENERGY
                      Estimate the events' energy with a regressor from
                      protopipe.scripts.build_model
--regressor_dir REGRESSOR_DIR
                      regressors directory

The configuration file used by this script is analysis.yaml,

# General informations
# NOTE: only Prod3b simulations are currently supported.
General:
config_name: 'v0.4.0_dev1'
site: 'north'  # 'north' or 'south'
# array can be either
# - 'subarray_LSTs', 'subarray_MSTs', 'subarray_SSTs' or 'full_array'
# - a custom list of telescope IDs
# WARNING: for simulations containing multiple copies of the telescopes,
# only 'full_array' or custom list are supported options!
array: full_array
cam_id_list : ['LSTCam', 'NectarCam'] # List of camera IDs to be used

# Cleaning for reconstruction
ImageCleaning:

# Cleaning for reconstruction
biggest:
tail:  #
 thresholds:  # picture, boundary
  - LSTCam: [6.61, 3.30]  # TBC
  - NectarCam: [5.75, 2.88]  # TBC
  - FlashCam: [4,2] # dummy values for reliable unit-testing
  - ASTRICam: [4,2] # dummy values for reliable unit-testing
  - DigiCam: [0,0] # values left unset for future studies
  - CHEC: [0,0] # values left unset for future studies
  - SCTCam: [0,0] # values left unset for future studies
 keep_isolated_pixels: False
 min_number_picture_neighbors: 1

wave:
 # Directory to write temporary files
 #tmp_files_directory: '/dev/shm/'
 tmp_files_directory: './'
 options:
  LSTCam:
   type_of_filtering: 'hard_filtering'
   filter_thresholds: [3, 0.2]
   last_scale_treatment: 'drop'
   kill_isolated_pixels: True
   detect_only_positive_structures: False
   clusters_threshold: 0
  NectarCam:  # TBC
   type_of_filtering: 'hard_filtering'
   filter_thresholds: [3, 0.2]
   last_scale_treatment: 'drop'
   kill_isolated_pixels: True
   detect_only_positive_structures: False
   clusters_threshold: 0

# Cleaning for energy/score estimation
extended:
tail:  #
 thresholds:  # picture, boundary
  - LSTCam: [6.61, 3.30]  # TBC
  - NectarCam: [5.75, 2.88]  # TBC
  - FlashCam: [4,2] # dummy values for reliable unit-testing
  - ASTRICam: [4,2] # dummy values for reliable unit-testing
  - DigiCam: [0,0] # values left unset for future studies
  - CHEC: [0,0] # values left unset for future studies
  - SCTCam: [0,0] # values left unset for future studies
 keep_isolated_pixels: False
 min_number_picture_neighbors: 1

wave:
 # Directory to write temporary files
 #tmp_files_directory: '/dev/shm/'
 tmp_files_directory: './'
 options:
  LSTCam:
   type_of_filtering: 'hard_filtering'
   filter_thresholds: [3, 0.2]
   last_scale_treatment: 'posmask'
   kill_isolated_pixels: True
   detect_only_positive_structures: False
   clusters_threshold: 0
  NectarCam:  # TBC
   type_of_filtering: 'hard_filtering'
   filter_thresholds: [3, 0.2]
   last_scale_treatment: 'posmask'
   kill_isolated_pixels: True
   detect_only_positive_structures: False
   clusters_threshold: 0

# Cut for image selection
ImageSelection:
charge: [50., 1e10]
pixel: [3, 1e10]
ellipticity: [0.1, 0.6]
nominal_distance: [0., 0.8]  # in camera radius

# Minimal number of telescopes to consider events
Reconstruction:
min_tel: 2

# Parameters for energy estimation
EnergyRegressor:
# Name of the regression method (e.g. AdaBoostRegressor, etc.)
method_name: 'AdaBoostRegressor'

# Parameters for g/h separation
GammaHadronClassifier:
# Name of the classification method (e.g. AdaBoostRegressor, etc.)
method_name: 'RandomForestClassifier'
# Use probability output or score
use_proba: True