Document Actions

Grid job configuration, submission and control tool (Rome-centric)

by Alberto Colla — last modified 2013-12-15 12:14

Tools to configure, submit and monitor Grid jobs for CW analysis

Introduction
JDL Configuration tool
Job submission tool
Job monitoring tool

Introduction

This tutorial deals with the configuration and running of Grid jobs for the "new" (2013) Frequency Hough (HFDF) analysis and the targeted searches with the Mock Data Challenge (MDC) data. The "old" (2012) Hough case is left behind.

The Grid tool is in SVN: vgridtools. In the Rome farm it has been downloaded in /opt/exp_soft/virgo/vgridtools

The package consists of 4 files:

vgridtools/trunk/tools/jdl_config_tools.py: JDL configuration tool
vgridtools/trunk/tools/systools.py: system tools used by jdl_config_tools.py
vgridtools/trunk/tools/submit_jobs.sh: job submission tool
vgridtools/trunk/tools/check-jobs: tool to check jobs and retrieve output of completed jobs.

The path to the package installed in Rome is already included in the bash PATH (in /etc/bashrc) of virgo-wn1,2,3. Therefore the commands are accessible from any folder by simply typing e.g . submit_jobs.sh.

JDL Configuration tool

The functions to buil the job JDL are in jdl_config_tools.py. The tool basically consists of two kind of functions:

create_<analysisname>_jdl: from a list of elements creates the corresponding set of JDLs.
check_missing_<analysisname>_output: based on the list of created jobs and parsing the output files, checks the missing output and re-creates the list

of failed jobs.

The simplest example is create_mdc_analysis_jdl (and create_mdc_band_extraction_jdl) in which the list is the injected pulsars (the list is available in /opt/exp_soft/virgo/CW/MDC/sources/mdc_infopulsar.txt).

To run the function:

ipython
import jdl_config_tools
jdl_config_tools.create_mdc_analysis_jdl()

(see the source file for the function arguments).

In creates "normal" jobs. In case of the analysis jobs The JDL name has the form:

MDC_analysis_<pulasr_name>.jdl

and the output is currently stored (in Rome) in:

/storage/pss/ligo_h/mdc/output/gd_<pulasr_name>_date.mat

By default the JDLs are stored in the folder:

cwd/<DATE>

To check the missing data for the current analysis (MDC LIGO Hanford) do:

ipython
import jdl_config_tools
missingData=jdl_config_tools.check_missing_mdc_analysis_output(retry=1, outputdir="/storage/pss/ligo_h/mdc/output/", date=<date_of_analysis>)

The function creates the list of missing JDL (MDC_analysis_list_1.txt). to be used with submit_jobs.sh. The file name contains the retry count.

Note that the date must be the date of the "original" analysis, otherwise the listed JDLs won't be found!

The Hough job functions (e.g. create_hfdf_jdl) instead require a list of (frequency_band, (list_of_sky_bands)). It creates #frequency_bands "Parametric" jobs in which the parameter array contains the sky bands in the corresponding frequency.

The input list is filled automatically by reading the input data set, which is created by the Snag function HFDF_PREPJOB.m.

To run the function, e.g. for VSR2:

ipython
import jdl_config_tools
jdl_config_tools.create_hfdf_jdl(run="VSR2")

(see the source file for the function arguments).

The JDL set is created by default in the folder (for VSR2, 256Hz analysis):

cwd/VSR2_256Hz-hp20/<DATE>/

and the file name is:

hfdf_<retry_count>_VSR2_01_<frequency>_<parameter_count>.jdl

where retry_count is one of the function arguments and parameter_count is used to split very long parameter arrays in bunches of 500 elements.

The output data is written by default in the folder:

/storage/pss/virgo/hfdf/VSR2/v3/256Hz-hp20/output/<DATE>/

To check failed jobs:

ipython
import jdl_config_tools
missingData=jdl_config_tools.check_missing_hfdf_output(
                       inputdir="/storage/pss/virgo/hfdf/VSR4/v0/256Hz-hp20/input/", 
                       outputdir="/storage/pss/virgo/hfdf/VSR4/v0/256Hz-hp20/output/<DATE>/",
                       run="VSR2", sampling=256, retry=1,
                       make_jdl=True, date="<DATE>")

Note that by default the JDLs are not recreated (make_jdl=False).

The function creates new JDLs containing the missing (frequency, sky_bands) only, and lists them in the file:

hfdf_list_1.txt

where the number is the retry count, which can be used as parameter of submit_jobs.sh.

Job submission tool

submit_jobs.sh submits to the Grid a set of jobs listed in the file given with the --file argument. In this tutorial the list is provided by the JDL configuration tools but of course it can be any list of jdl files!

Options are:

--file <file list> (compulsory);
--resubmit (boolean, optional): resubmit all jobs
--try: number of attempts to submit a job before it is declared failed (optional, default: 3)

submit_jobs.sh tries to submit a JDL a given number of times. If submission fails the JDL is listed in the file failed_<DATE_TIME>.txt, otherwise it goes in submitted_<DATE_TIME>.txt. Therefore it is possible to resubmit failed JDLs simply re-running the command on failed_<DATE_TIME>.txt.

Submssion is retried a given number of times

Successfully submitted jobs are listed in the file submitted.txt; failed submissions go into failed.txt

Usage example (Initialize proxies first! myproxy-init -d -n, voms-proxy-init --voms virgo):

submit_jobs.sh --file failed_20131212-0008.txt --resubmit

Job monitor tool

check_jobs.py checks the status of submitted jobs. Arguments:

-f SUBMITFILE, --file=SUBMITFILE: JDL list file
-g, --get-output: Get output of DONE jobs (optional)
-v, --verbose: Verbose mode (boolean, optional)

If verbose mode is OFF, the command gives the summary status of the job collection, such as:

Summary: 44 Cleared - 18 Done(Success) - 18 Running - 28 Waiting

If verbose is ON the status of each job is displayed.

Status is written on the screen and in the log file check_jobs.log.

--get-output retrieves the output of the jobs in the folder:

jobOut/jobFileName.

Usage example:

check_jobs.py --file hfdf_list_0.txt --verbose --get-output

Please send corrections and updates to the administrator.

-- AlbeColla - 12 Dec 2013

Sections

Personal tools