Grid job configuration, submission and control tool (Rome-centric)
Tools to configure, submit and monitor Grid jobs for CW analysis
Introduction
This tutorial deals with the configuration and running of Grid jobs for the "new" (2013) Frequency Hough (HFDF) analysis and the targeted searches with the Mock Data Challenge (MDC) data. The "old" (2012) Hough case is left behind.
The Grid tool is in SVN: vgridtools
. In the Rome farm it has been downloaded in /opt/exp_soft/virgo/vgridtools
The package consists of 4 files:
-
vgridtools/trunk/tools/jdl_config_tools.py
: JDL configuration tool -
vgridtools/trunk/tools/systools.py
: system tools used byjdl_config_tools.py
-
vgridtools/trunk/tools/submit_jobs.sh
: job submission tool -
vgridtools/trunk/tools/check-jobs
: tool to check jobs and retrieve output of completed jobs.
The path to the package installed in Rome is already included in the bash PATH (in /etc/bashrc
) of virgo-wn1,2,3
. Therefore the commands are accessible from any folder by simply typing e.g . submit_jobs.sh
.
JDL Configuration tool
The functions to buil the job JDL are in jdl_config_tools.py
. The tool basically consists of two kind of functions:
-
create_<analysisname>_jdl
: from a list of elements creates the corresponding set of JDLs. -
check_missing_<analysisname>_output
: based on the list of created jobs and parsing the output files, checks the missing output and re-creates the list
The simplest example is create_mdc_analysis_jdl
(and create_mdc_band_extraction_jdl
) in which the list is the injected pulsars (the list is available in /opt/exp_soft/virgo/CW/MDC/sources/mdc_infopulsar.txt
).
To run the function:
ipython import jdl_config_tools jdl_config_tools.create_mdc_analysis_jdl()
(see the source file for the function arguments).
In creates "normal" jobs. In case of the analysis jobs The JDL name has the form:
MDC_analysis_<pulasr_name>.jdl
and the output is currently stored (in Rome) in:
/storage/pss/ligo_h/mdc/output/gd_<pulasr_name>_date.mat
By default the JDLs are stored in the folder:
cwd/<DATE>
To check the missing data for the current analysis (MDC LIGO Hanford) do:
ipython import jdl_config_tools missingData=jdl_config_tools.check_missing_mdc_analysis_output(retry=1, outputdir="/storage/pss/ligo_h/mdc/output/", date=<date_of_analysis>)
The function creates the list of missing JDL (MDC_analysis_list_1.txt
). to be used with submit_jobs.sh
. The file name contains the retry count.
Note that the date must be the date of the "original" analysis, otherwise the listed JDLs won't be found!
The Hough job functions (e.g. create_hfdf_jdl
) instead require a list of (frequency_band, (list_of_sky_bands))
. It creates #frequency_bands
"Parametric" jobs in which the parameter array contains the sky bands in the corresponding frequency.
The input list is filled automatically by reading the input data set, which is created by the Snag function HFDF_PREPJOB.m
.
To run the function, e.g. for VSR2:
ipython import jdl_config_tools jdl_config_tools.create_hfdf_jdl(run="VSR2")
(see the source file for the function arguments).
The JDL set is created by default in the folder (for VSR2, 256Hz analysis):
cwd/VSR2_256Hz-hp20/<DATE>/
and the file name is:
hfdf_<retry_count>_VSR2_01_<frequency>_<parameter_count>.jdl
where retry_count
is one of the function arguments and parameter_count is used to split very long parameter arrays in bunches of 500 elements.
The output data is written by default in the folder:
/storage/pss/virgo/hfdf/VSR2/v3/256Hz-hp20/output/<DATE>/
To check failed jobs:
ipython import jdl_config_tools missingData=jdl_config_tools.check_missing_hfdf_output( inputdir="/storage/pss/virgo/hfdf/VSR4/v0/256Hz-hp20/input/", outputdir="/storage/pss/virgo/hfdf/VSR4/v0/256Hz-hp20/output/<DATE>/", run="VSR2", sampling=256, retry=1, make_jdl=True, date="<DATE>")
Note that by default the JDLs are not recreated (make_jdl=False
).
The function creates new JDLs containing the missing (frequency, sky_bands)
only, and lists them in the file:
hfdf_list_1.txt
where the number is the retry count, which can be used as parameter of submit_jobs.sh
.
Job submission tool
submit_jobs.sh
submits to the Grid a set of jobs listed in the file given with the --file
argument. In this tutorial the list is
provided by the JDL configuration tools but of course it can be any list of jdl files!
Options are:
-
--file <file list>
(compulsory); -
--resubmit
(boolean, optional): resubmit all jobs -
--try
: number of attempts to submit a job before it is declared failed (optional, default: 3)
submit_jobs.sh
tries to submit a JDL a given number of times. If submission fails the JDL is listed in the file failed_<DATE_TIME>.txt
, otherwise
it goes in submitted_<DATE_TIME>.txt
. Therefore it is possible to resubmit failed JDLs simply re-running the command on failed_<DATE_TIME>.txt
.
Submssion is retried a given number of times
Successfully submitted jobs are listed in the file submitted.txt; failed submissions go into failed.txt
Usage example (Initialize proxies first! myproxy-init -d -n, voms-proxy-init --voms virgo
):
submit_jobs.sh --file failed_20131212-0008.txt --resubmit
Job monitor tool
check_jobs.py
checks the status of submitted jobs. Arguments:
-
-f SUBMITFILE, --file=SUBMITFILE
: JDL list file -
-g, --get-output
: Get output ofDONE
jobs (optional) -
-v, --verbose
: Verbose mode (boolean, optional)
If verbose mode is OFF, the command gives the summary status of the job collection, such as:
Summary: 44 Cleared - 18 Done(Success) - 18 Running - 28 Waiting
If verbose is ON the status of each job is displayed.
Status is written on the screen and in the log file check_jobs.log
.
--get-output
retrieves the output of the jobs in the folder:
jobOut/jobFileName
.
Usage example:
check_jobs.py --file hfdf_list_0.txt --verbose --get-output
Please send corrections and updates to the administrator.
-- AlbeColla - 12 Dec 2013