Using Condor An Introduction Condor Week 2004

Using Stork
Barcelona, 2006
Condor Project
Computer Sciences Department
University of Wisconsin-Madison
[email protected]
http://www.cs.wisc.edu/condor
1
Meet Friedrich*
Friedrich is a
scientist with a
BIG problem.
*Frieda’s twin brother
2
I have a lot of
data to process.
3
Friedrich's problem …
Friedrich has many large data sets to
process. For each data set:
1. stage the data in from a remote server
2. run a job to process the data
3. stage the data out to a remote server
4
The Classic
Data Transfer Job
#!/bin/sh
globus-url-copy
source
dest
Scripts often work fine for short,
simple data transfers, but…
5
Many things can go wrong!
These errors are more likely with large
data sets:
 The network is down.
 The data server is unavailable.
 The transferred data is corrupt.
 The workflow does not know that the data
was bad.
6
Stork Solves Problems
 Creates the concept of the
data placement job
 Managed and scheduled the
same as any Condor job
 Friedrich’s jobs benefit from
built-in fault tolerance
7
Supported Data Transfer
Protocols





local file system
GridFTP
FTP
HTTP
SRB
 NeST
 SRM
 and, it is
extensible to
other protocols
8
Fault Tolerance
 Retries failed jobs
 Can also retry a failed data transfer
job using an alternate protocol.
For example, first try GridFTP, then try
FTP
 Retry stuck jobs
 Configurable fault responses
9
Getting Stork
 Stork is part of Condor, so get
Condor. . .
 Available as a free download from
http://www.cs.wisc.edu/condor
 Currently available for Linux
platforms
10
Personal Condor works well
with Stork
 This is Condor/Stork on your own
workstation, no root access required,
no system administrator intervention
needed
 After installation, Friedrich submits
his jobs to his Personal Stork…
11
Friedrich’s Personal Condor
Friedrich's workstation:
Central
Mgr.
Master
Stor
k
StartD
SchedD
data
jobs
DAG
CPU
jobs
N compute elements
external
data
servers
12
Stork will ...
 Keep an eye on data placement jobs, and it
will keep you posted on their progress
 Throttle the maximum number of jobs
running
 Keep a log of job activities
 Add fault tolerance to all jobs
 Detect and retry failed data placement jobs
13
The Submit Description File
 Just like the rest of Condor, a plain ASCII
text file, but with a different format
 Written in new ClassAd language
 Neither Stork nor Condor care about file
name extensions
 Contents of file tells Stork about jobs:
 data placement type, source/destination
location/protocol, proxy location, alternate
protocols to try
14
Simple Submit File
// c++ style comment lines
// file name is stage-in.stork
[
dap_type = "transfer";
src_url
= “http://server/path";
dest_url = "file:///dir/file";
log
= "stage-in.log";
]
Note: different format from Condor submit files
15
Another Simple Submit File
// c++ style comment lines
// file name is stage-in.stork
[
dap_type = "transfer";
src_url
= “gsiftp://server/path";
dest_url = "file:///dir/file";
x509proxy = "default";
log
= "stage-in.log";
]
Note: different format from Condor submit files
16
Running stork_submit
 Give stork_submit the name of the
submit file:
% stork_submit
stage-in.stork
 stork_submit parses the submit
file, checks for it errors, and sends
the job to the Stork server.
 stork_submit returns the created
job id (a job handle)
17
Sample stork_submit
% stork_submit stage-in.stork
================
Sending request:
[
dest_url = "file:///dir/file";
src_url = “http://server/path";
dap_type = "transfer";
log = "path/stage-in.log";
]
================
Request assigned id: 1
job id
18
The Job Queue
 stork_submit sends the job to the
Stork server
 The Stork server manages the local job
queue
 View the queue with stork_q, or
stork_status
19
Job Status
 stork_q queries all active jobs
% stork_q
 stork_status queries the given job
id, which may be active, or complete
% stork_status 12
20
Removing jobs
 To remove a data placement job from the
queue, use stork_rm
 You may only remove jobs that you own
(Unix root may remove anyone’s jobs)
 Give a specific job ID
% stork_rm 21
removes a single job
21
Use Log Files
// c++ style comment lines
[
dap_type = "transfer";
src_url
= "gsiftp://server/path";
dest_url = "file:///dir/file";
x509proxy = "default";
log
= "stage-in.log";
]
22
Sample Stork User Log
000
...
001
...
008
...
008
...
008
...
005
(001.-01.-01) 04/17 19:30:00 Job submitted from host: <128.105.121.
(001.-01.-01) 04/17 19:30:01 Job executing on host: <128.105.121.53
(001.-01.-01) 04/17 19:30:01 job type: transfer
(001.-01.-01) 04/17 19:30:01 src_url: gsiftp://server/path
(001.-01.-01) 04/17 19:30:01 dest_url: file:///dir/file
(001.-01.-01) 04/17 19:30:02 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
23
Stork and DAGMan
Data placement
jobs are
integrated with
Condor’s DAGMan,
and Friedrich
benefits
24
Defining Friedrich's DAG
data to stage in
crunch the data
data to stage out
25
Friedrich’s DAG
input1
input2
crunch
result
26
The DAG Input File
# file name is friedrich.dag
DATA input1 input1.stork
DATA input2 input2.stork
JOB crunch process.submit
DATA result result.stork
PARENT input1 input2 CHILD crunch
PARENT crunch CHILD result
27
One of the Stork Submit
Files
// file name is input1.stork
[
dap_type = "transfer";
src_url
= “http://north.cs.wisc.edu/
~freidrich/data1";
dest_url = "file:///home/friedrich/in1";
log
= "in1.log";
]
28
Condor Submit Description
File
# file name
universe
executable
input
output
error
log
queue
is process.submit
= vanilla
= process
= in1
= crunch.result
= crunch.err
= crunch.log
29
Stork Submit File
// file name is result.stork
[
dap_type = "transfer";
src_url
=
"file:///home/friedrich/crunch.result";
dest_url = “http://north.cs.wisc.edu/
~friedrich/final.results";
log
= "result.log";
]
30
Friedrich Submits the DAG
• While Friedrich’s current working
directory is /home/friedrich
% condor_submit_dag friedrich.dag
31
In Review
With Stork Friedrich now can…
 Submit data processing jobs and go
home! Because,
Stork manages the data transfers,
including fault detection and retry
 Condor DAGMan manages
dependencies.
32
Additional Resources
 http://www.cs.wisc.edu/condor/stork/
 Condor Manual, Stork section
 [email protected] list
 [email protected] list
33
Additional Slides
34
Important Parameters
 STORK_MAX_NUM_JOBS limits number
of active jobs
 STORK_MAX_RETRY limits job
attempts, before job marked as failed
 STORK_MAXDELAY_INMINUTES
specifies “hung job” threshold
35
Current Restrictions
 Currently, best suited for “Personal
Stork” mode
 Local file paths must be valid on Stork
server, including submit directory.
 To share data, successive jobs in DAG
must use shared filesystem
36
Future Work
 Enhance multi-user fair share
 Enhance support for DAGs without
shared file system
 Enhance scheduling with configurable
job requirements and rank
 Add DAP job matchmaking
 Additional platform ports
37