Personal tools
You are here: Home Virgo Data Analysis VDAS Data transfer with Grid tools: Basic User Guide
Navigation
Log in


Forgot your username or password?
 
Document Actions

Data transfer with Grid tools: Basic User Guide

by Alberto Colla last modified 2013-04-09 11:30

This page aims to collect the information about the Grid Data Transfer (DA) framework. We start with a very basic how to: start a transfer, monitor it...

DT software location

DA runs on virgo-ui.virgo.infn.it, which contains the LCG tools for the data transfer transfer and management in Grid environment.

DA software is in CVS, under vgridtools root. A test instance is available in:

/virgoDev/vgridtools

An important requirement is that the files are POSIX accessible in the local storage.


DT components

The main DT files are:

  • vgridtools/TS/TS.py: the DT code;
  • vgridtools/TS/TS.ini: DT configuration file;
  • vgridtools/TS/TS.log: DT log file;
  • vgridtools/TS/socketServer.py: socket server implementing the communication protocol;
  • vgridtools/TS/ts_send_command.py: socket client.


DT features

Here are the main features of the transfer framework:

  • It implements a communication channel to the DAQ via a socket server/client: new files are directly added to the transfer queues.
  • The same channel also acts as a user interface (command queue) to the data transfer: the user can manually add or remove files from the transfer queues, open/close transfer flows to a transfer endpoint, etc.
  •  The use of the Grid LCG tools allows to use a single command (lcg-cp) to make the first local->remote transfer and third party transfers (replicas between two remote storage servers).
  • It implements a "load balancing" among the data flows: the system keeps track of the actual number of incoming and outgoing flows in each endpoint, and chooses the following replica endpoints among the least "crowded" ones. Furthermore, in the configuration it is possible to set limits on the number of concurrent incoming/outgoing flows in each endpoint.
  • It implements local and remote file checksums
  • The remote file replicas are added to the Grid LCG File Catalogue (LFC), which provides a solution for data bookeeping and a transparent way to access data stored in a distributed environment.


"Robustness" features:
  • The transfer process (lcg-cp) is done by a subprocess launched by the main thread. The main thread monitors the process, kills it in case of timeouts, traces transfer failures, etc...
  • The system keeps track of each file's transfer failures. After a given number of transfers retries the file is marked as "failed" and put in a dead queue. It can be put back to the transfer queue manually by the operator, through the socket command queue.
  • The system keeps track on the transfer failures from/to each endpoint. An endpoint is automatically closed if transfers from/to it consecutively fail more than a given (configurable) number of times. It can be later re-opened by the operator through the command queue.
  • The command queue is persistent, thus in case of crash of the DT process the commands aren't lost, and are processed once the DT is back.
  • The DAQ infrastructure also keeps track of the messages in case they don't reach the socket server, thus also in case of server crash files aren't lost.


The DT was tested during the VSR3. It transferred ~3 TB of raw data (5 days) to Cnaf (disk/GPFS) and in2p3 (tape/HPSS via dCache).
Although the test was done in parallel with the "official" data transfer, therefore not in ideal conditions, no latency was observed, and
the robustness of the framework was confirmed.


DT configuration: TS.ini


Work in progress


how to run DT (admin only)

The DT needs a valid Grid certificate and proxy. Currently DT uses Alberto Colla's certificate, with automatically proxy renewal. It is foreseen to use a Robot certificate for a production environment.


start the DT process: ./vgridtools/TS/startTS.sh

start the socket server: ./vgridtools/TS/startSocket.sh


DT web interface

DT provides a dynamic web interface. In the available test configuration the content is broadcast through the port 8564

The easiest way to display the web insterface is to open a SSH tunnel to virgo-ui.virgo.infn.it:8564:

ssh <username>@virgo-ui.virgo.infn.it -L 8564:virgo-ui.virgo.infn.it:8564 -N

then open a browser and type:

http://localhost:8564/status.esp

Below is a screenshot of the display:

DT Display























The tables displayed are:

1. SE Status: displays the status of the various transfer endpoints enabled in the configuration.
  • SE hostname: The first is the local storage.
  • Status: True if the storage is ON, False otherwise
  • Data IN/OUT: the number of data streams currently incoming/outgoing from the storage
  • Consecutive failures: number of consecutive transfer failures to/from this storage. If the maximum number of allowed failures is overcome the storage endpoint is automatically set OFF, and can be eventually turned ON by the user (see Communication Protocol chapter)
  • Next status switch: time of the next scheduled status switch (see Communication Protocol chapter)
  • Comment: Comment associated to the user's status switch (see Communication Protocol chapter)

2. Queue status: the number of transfer tasks in each queue: Checksum/Transfer/Register/Done/Delayed/Dead is displayed.
  • Checksum: local checksum
  • Transfer: transfer from local to remote endpoint, or between two remote endpoints
  • Register: registration to the Grid Logical File Catalog (LFC) of the transferred files
  • Done: number of done tasks
  • Delayed: Number of delayed tasks (a task is delayed if, for instance, an endpoint is OFF, or if the maximum number of concurrent transfers is reached)
  • Dead: Number of failed transfers
3. File status: the status of each transfer task
  • File name: the file name
  • Size: the file size
  • Status: Current status
  • Local checksum: local checksum
  • Remote checksums: an array containing the checksums of the remote files (one for each endpoint, in the same order as the SE Status table). Clearly remote and local checksums must coincide!
  • Transfer status: array containing the transfer status to each endpoint: None|Error|Done|Failed
  • Fransfer failures: array containing the number of transfer failures to each endpoint. If the transfer fails for more than the maximum allowed times it is considered Failed.
  • Register status: array containing the registering status of each replica.
  • Register failures: array containing the number of register failures of each replica.
  • File is valid: True if the file is valid. User can set the file as "invalid" via a socket server command; in this case the task is removed from the queue.

DT-User communication protocol

The DT communicates with the Virgo FrameBuilder and the administrators through TCP protocol and a client-server framework.

The server is implemented in the socketServer.py script and the client by the ts_send_command.py script, which can be downloaded here.


DT user commands


To send a command to the DT:

./ts_send_command <command> arg1:arg2:...:argN

Here is the list of available commands and correspondent arguments:

  • fileadd source_file:type:stream:comment
adds source_file to the transfer queue. source_file must be POSIX accessible to the DT server. type and stream are converted in a path in the remote storages and in the LFC. For instance, with the argumnts:
fileadd /src_path_prefix/filename:raw:1:test
the file will be stored as follows in the remote storages and the LFC:
/dest_path_prefix/raw/1/filename
where dest_path_prefix is a root path defined in the configuration.
Slashes (/) are allowed in the type and stream arguments, which will result in a "deeper" file path.
  • filedel source_file:stream:comment
removes source_file from the transfer queue. source_file and stream must be the same as in the fileadd case. Warning: existing remote replicas will not be deleted automatically!
  • filemv old_source_file:old_stream:new_source_file:new_stream:comment
"renames" a file in the transfer queue. Warning: existing remote replicas with the "old" args will not be deleted automatically!
  • filereprocess source_file:stream:comment
reprocesses a file in the dead queue (whose transfer failed)
  • reprocess_all
reprocesses all files in the dead queue (failed files)
  • start channel:delay:comment

"opens" a remote endpoint (channel). Example:

start storm-fe-archive.cr.cnaf.infn.it:3600:test

opens transfers to CNAF endpoint with a delay of 1 hour

  • stop channel:delay:comment

"closes" a remote endpoint (channel). Example:

stop storm-fe-archive.cr.cnaf.infn.it:3600:test

stops transfers to CNAF endpoint with a delay of 1 hour

  • shutdown
terminates DT server. Do not try it if you want to keep playing with the test setup!
  • shutdown_socket
terminates socket server. Do not try it if you want to keep playing with the test setup!
  • qreset
resets all transfer queues.
  • reloadConfig
reloads configuration (issue it after a change in TS.ini)
  • clean_done
removes DONE files from the Done queue

User tutorial (try it!)

The test instance of the DT in virgo-ui.virgo.infn.it:/virgoDev/vgridtools is running so you don't have to start it.

Below are the minimal instructions to see the DT working, from a "user" perspective:


1. Open the 8564 port via ssh tunnel to access the web interface:

ssh <username>@virgo-ui.virgo.infn.it -L 8564:virgo-ui.virgo.infn.it:8564 -N

then open a browser and type:

http://localhost:8564/status.esp


2. You may also want to log on virgo-ui.virgo.infn.it, or another machine in the Cascina domain which shares /virgoDev, and follow the DT log:

ssh <username>@virgo-ui.virgo.infn.it

cd /virgoDev/vgridtools/TS

tail -f TS.log


3. Now log from another shell on virgo-ui.virgo.infn.it, or another machine in the Cascina domain, and add a file in the transfer queue (e.g. a raw file) using ts_send_command.py in attachement:

./ts_send_command.py fileadd /data/rawdata/v102/V-raw-1040911950-150.gwf:raw:VSR4:test

With the actual configuration the file will be replicated to CNAF and in2p3, and registered into the LFC.

You may also try to add the same file again to the queue, add a non existing file, close and reopen a stream... and see what happens!

4. At CNAF and Lyon the file will be accessible via Grid tools. To check it you need to have a valid Grid proxy, and define the following environmental variables:

export LFC_HOST=lfcserver.cnaf.infn.it
export LFC_HOME=/grid/virgo/
export LCG_GFAL_VO=virgo

In the LFC the files will be registered under /grid/virgo/TransferTest2013. To query them (from virgo-ui.virgo.infn.it):

$ lfc-ls TransferTest2013/raw/VSR4
V-raw-1040911950-150.gwf


To list the file replicas:

$ lcg-lr lfn:/grid/virgo/TransferTest2013/raw/VSR4/V-raw-1040911950-150.gwf
srm://ccsrm02.in2p3.fr/pnfs/in2p3.fr/data/virgo/tape/TransferTest2013/raw/VSR4/V-raw-1040911950-150.gwf
srm://storm-fe-archive.cr.cnaf.infn.it/virgo3/TransferTest2013/raw/VSR4/V-raw-1040911950-150.gwf

To download the file:

$ lcg-cp -v lfn:/grid/virgo/TransferTest2013/raw/VSR4/V-raw-1040911950-150.gwf file:`pwd`/myfile.gwf

Follow this link for a simple tutorial on these and other LFC/LCG tools:

http://www-numi.fnal.gov/computing/minossoft/releases/R2.0/GridTools/docs/data_lfc_lcg.html


At CNAF the transferred files will be also accessible locally from the Virgo user interfaces (ui01-virgo.cnaf.infn.it and ui02-virgo.cnaf.infn.it), under the folder /storage/gpfs_virgo3/virgo/TransferTest2013/:

/storage/gpfs_virgo3/virgo/TransferTest2013/raw/VSR4/V-raw-1040911950-150.gwf