slides

Using and Administering
Condor
Alain Roy
Computer Sciences Department
University of Wisconsin-Madison
[email protected]
http://www.cs.wisc.edu/condor
24-June-2002
Добрый вечер!
› Thank you for having me!
› I am:
Alain Roy
Computer Science Ph.D. in Quality of
Service, with Globus Project
Working with the Condor Project
www.cs.wisc.edu/condor
Condor Tutorials Remaining
› Monday (Today)
17:00-19:00
› Tuesday
17:00-19:00
Using and administering Condor
Using Condor on the Grid
www.cs.wisc.edu/condor
Review: What is Condor?
› Condor converts collections of
distributively owned workstations and
dedicated clusters into a distributed
high-throughput computing facility.
Run lots of jobs over a long period of time,
Not a short burst of “high-performance”
› Condor manages both machines and jobs
with ClassAd Matchmaking to keep
everyone happy
www.cs.wisc.edu/condor
Condor Takes Care of You
› Condor does whatever it takes to run
your jobs, even if some machines…
Crash (or are disconnected)
Run out of disk space
Don’t have your software installed
Are frequently needed by others
Are far away & managed by someone
else
www.cs.wisc.edu/condor
What is Unique about
Condor?
›
›
›
›
›
ClassAds
Transparent checkpoint/restart
Remote system calls
Works in heterogeneous clusters
Clusters can be:
Dedicated
Opportunistic
www.cs.wisc.edu/condor
What’s Condor Good For?
› Managing a large number of jobs
 You specify the jobs in a file and submit them
to Condor, which runs them all and sends you
email when they complete
 Mechanisms to help you manage huge numbers
of jobs (1000’s), all the data, etc.
 Condor can handle inter-job dependencies
(DAGMan)
www.cs.wisc.edu/condor
What’s Condor Good For?
(cont’d)
› Robustness
 Checkpointing allows guaranteed forward
progress of your jobs, even jobs that run for
weeks before completion
 If an execute machine crashes, you only lose
work done since the last checkpoint
 Condor maintains a persistent job queue - if the
submit machine crashes, Condor will recover
 (Story)
www.cs.wisc.edu/condor
What’s Condor Good For?
(cont’d)
› Giving your job the agility to access more
computing resources
 Checkpointing allows your job to run on
“opportunistic resources” (not dedicated)
 Checkpointing also provides “migration” - if a
machine is no longer available, move!
 With remote system calls, run on systems which
do not share a filesystem - You don’t even need
an account on a machine where your job
executes
www.cs.wisc.edu/condor
Other Condor features
› Implement your policy on when the
jobs can run on your workstation
› Implement your policy on the
execution order of the jobs
› Keep a log of your job activities
www.cs.wisc.edu/condor
A Condor Pool In Action
www.cs.wisc.edu/condor
A Bit of Condor Philosophy
› Condor brings more computing to
everyone
A small-time scientist can make an
opportunistic pool with 10 machines, and
get 10 times as much computing done.
A large collaboration can use Condor to
control it’s dedicated pool with hundreds
of machines.
www.cs.wisc.edu/condor
The Idea
Computing power
is everywhere,
we try to make it usable by
anyone.
www.cs.wisc.edu/condor
Remember Frieda?
Today we’ll revisit
Frieda’s Condor
explorations in
more depth
www.cs.wisc.edu/condor
I have 600
simulations to run.
Where can I get
help?
www.cs.wisc.edu/condor
Install a Personal
Condor!
www.cs.wisc.edu/condor
Installing Condor
› Download Condor for your operating
›
›
system
Available as a free download from
http://www.cs.wisc.edu/condor
Available for most Unix platforms and
Windows NT
www.cs.wisc.edu/condor
So Frieda Installs Personal
Condor on her machine…
› What do we mean by a “Personal”
Condor?
Condor on your own workstation, no root
access required, no system administrator
intervention needed—easy to set up.
www.cs.wisc.edu/condor
Personal Condor?!
What’s the benefit of a
Condor “Pool” with just one
user and one machine?
www.cs.wisc.edu/condor
Your Personal Condor will ...
› Keep an eye on your jobs and will keep you
›
›
›
posted on their progress
Keep a log of your job activities
Add fault tolerance to your jobs
Implement your policy on when the jobs can
run on your workstation
www.cs.wisc.edu/condor
What’s in a Personal
Condor?
› Everything that is in Condor, just one
machine.
› Condor daemons:
 Condor_master
 Condor_collector—Stores ClassAds for jobs, machines
 Condor_negotiator—Matchmaking
 Condor_schedd—Submits, monitors jobs
 Condor_startd—Starts jobs
 Condor_starter—Launches a job
 Condor_shadow—Monitors remote job
www.cs.wisc.edu/condor
A Condor Pool of One
Condor_master
Condor_negotiator
Condor_schedd
Condor_collector
Condor_shadow
Condor_startd
Condor_starter
Condor
job
www.cs.wisc.edu/condor
condor_master
› Starts up all other Condor daemons
› If there are any problems and a daemon
›
exits, it restarts the daemon and sends email
to the administrator
Checks the time stamps on the binaries of
the other Condor daemons, and if new
binaries appear, the master will gracefully
shutdown the currently running version and
start the new version
www.cs.wisc.edu/condor
condor_master (cont’d)
› Acts as the server for many Condor
remote administration commands:
condor_reconfig, condor_restart,
condor_off, condor_on,
condor_config_val, etc.
www.cs.wisc.edu/condor
condor_startd
› Represents a machine to the Condor
system
› Responsible for starting, suspending,
and stopping jobs
› Enforces the wishes of the machine
owner (the owner’s “policy”… more on
this soon)
www.cs.wisc.edu/condor
condor_schedd
› Represents users to the Condor system
› Maintains the persistent queue of jobs
› Responsible for contacting available
›
machines and sending them jobs
Services user commands which manipulate
the job queue:
 condor_submit,condor_rm, condor_q,
condor_hold, condor_release, condor_prio, …
www.cs.wisc.edu/condor
condor_collector
› Collects information from all other Condor
daemons in the pool
 “Directory Service” / Database for a Condor
pool
› Each daemon sends a periodic update called
›
a “ClassAd” to the collector
Services queries for information:
 Queries from other Condor daemons
 Queries from users (condor_status)
www.cs.wisc.edu/condor
condor_negotiator
› Performs “matchmaking” in Condor
› Gets information from the collector about
›
›
all available machines and all idle jobs
Tries to match jobs with machines that will
serve them
Both the job and the machine must satisfy
each other’s requirements
www.cs.wisc.edu/condor
Frieda wants more…
› She decides to use the
graduate students’
computers when they
aren’t, and get done
sooner.
› In exchange, they can use
the Condor pool too.
www.cs.wisc.edu/condor
Frieda’s Condor pool…
Frieda’s Computer:
Central Manager
Graduate Student’s Desktop Computers
www.cs.wisc.edu/condor
A larger Condor pool
Submitter
Collector
Condor_master
Condor_master
Condor_schedd
Condor_negotiator
Condor_shadow
Condor_collector
Executor
Submitter/Executor
Condor_master
Condor_master
Condor_startd
Condor_schedd Condor_startd
Condor_starter
Condor_shadow Condor_starter
Condor Job
Condor Job
www.cs.wisc.edu/condor
Happy Day! Frieda’s
organization purchased a
Beowulf Cluster!
› Other scientists in her
department have realized
the power of Condor and
want to share it..
› The Beowulf cluster and
the graduate student
computers can be part of
a single Condor pool.
www.cs.wisc.edu/condor
Frieda’s Condor pool…
Graduate Student’s Desktop Computers
Central Manager
Beowulf Cluster
www.cs.wisc.edu/condor
How would you set it up?
› Grad student machines:
Submitters
Executors
› Beowulf cluster machines
Executors only
› Independent machine for collector/neg
Big job—take it away from Freida’s computer
Could split collector and negotiator
www.cs.wisc.edu/condor
Frieda collaborates…
› She wants to share her
Condor pool with
scientists from another
lab.
www.cs.wisc.edu/condor
Condor Flocking
› Condor pools can work cooperatively
www.cs.wisc.edu/condor
How would you set it up?
› Two independent pools
Each has it’s own collector/negotiator
› Set up flocking from one pool to
another: by machine, or by pool.
FLOCK_TO <machine>
FLOCK_FROM <machine>
› Can be uni- or bi-directional
www.cs.wisc.edu/condor
Questions So Far?
www.cs.wisc.edu/condor
How do you run a job?
› It doesn’t matter if you have:
 Personal Condor
 Large Condor pool
 Condor pool with flocking
› Four steps
1. Write program
2. Write submit file
3. Give it to Condor
4. Condor gives you the results
www.cs.wisc.edu/condor
Step 1: Writing a program
› Condor has universes
Vanilla Universe:
• Run anything
• Less capable
Java Universe: Works better for Java
Standard Universe:
• Checkpointing
• Remote I/O
• Can’t work with all programs
www.cs.wisc.edu/condor
Step 1: Vanilla Universe
› You can run any program
C/C++/Perl/Python/Fortran/Java/Lisp…
No checkpointing: if your job is
interrupted or the machine crashes,
Condor has to restart it from the
beginning.
Can do anything you could do if you were
logged in.
www.cs.wisc.edu/condor
Step 1: Java Universe
› Works better for Java programs
› Checks for valid Java environment
› Distinguishes Java environment
exceptions from program exceptions
(wrapper program)
› No checkpointing (it could happen though)
› Remote I/O
www.cs.wisc.edu/condor
Step 1: Standard Universe
› Requires re-linking your program
 condor_compile gcc –o simple simple.o
› Allows checkpointing and remote I/O
› Restrictions on behavior
No threading
Limited networking
Restrictions on compiler used
www.cs.wisc.edu/condor
Step 2: Write submit file
Executable =
Universe
=
Arguments =
Log
=
Output
=
Error
=
Requirements
Queue
simple
vanilla
First
simple.log
simple.output
simple.error
= Memory > 512
Note: This assumes a shared filesystem
www.cs.wisc.edu/condor
Step 2: Write submit file
Executable = simple
Universe
= vanilla
Arguments = First
Log
= simple.log
Output
= simple.output
Error
= simple.error
Transfer_input_files = data.in
Transfer_output_files = data.out
Requirements = Memory > 512
Queue
Note: This does not assume a shared filesystem
www.cs.wisc.edu/condor
Step 2: Write submit file
Executable =
Universe
=
Arguments =
Log
=
Output
=
Error
=
Requirements
Queue
simple
standard
First
simple.log
simple.output
simple.error
= Memory > 512
Note: This does not assume a shared filesystem, but remote I/O
www.cs.wisc.edu/condor
Step 2: Submit Files
› Condor is helpful: it makes a real
requirements:
Requirements = memory > 512 becomes…
Requirements = (OpSys == “Linux”) &&
(memory > 512) && <shared-filesystem>…
› Queue can take a parameter (more
later)
› A single file can submit many jobs
www.cs.wisc.edu/condor
Step 3: Give it to Condor
› condor_submit submit.desc
› condor_q
-- Submitter: dsonokwa.cs.wisc.edu : <128.105.175.130:36280> :
dsonokwa.cs.wisc.edu
ID OWNER SUBMITTED
RUN_TIME ST PRI SIZE CMD
5.0 roy
6/15 20:51
0+00:00:02 R 0
0.0 simple First
1 jobs; 0 idle, 1 running, 0 held
www.cs.wisc.edu/condor
Step 4: Condor gives it
back
› The program’s output is where you
asked it to be.
› Condor left a log file documenting
what it did.
› Condor optionally sends you an email
telling you it’s done.
www.cs.wisc.edu/condor
Step 4: Condor gives it back
000 (34364.000.000) 06/15 21:00:01 Job submitted from host:
<128.105.146.14:34918>
001 (34364.000.000) 06/15 21:00:01 Job executing on host:
<128.105.146.36:34918>
005 (34364.000.000) 06/15 21:00:06 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
www.cs.wisc.edu/condor
Step 4: Condor gives it
back
Date: Sat, 15 Jun 2002 21:00:06 -0500 (CDT)
From: Condor Project <[email protected]>
Message-Id: <[email protected]>
To: [email protected]
Subject: [Condor] Condor Job 34364.0
This is an automated email from the Condor system
on machine "beak.cs.wisc.edu". Do not reply.
Your condor job exited with status 0.
Job: /scratch/roy/condor/simple/simple First
www.cs.wisc.edu/condor
Clusters and Processes
› If your submit file describes multiple jobs,
›
›
›
›
we call this a “cluster”.
Each job within a cluster is called a “process”
or “proc”.
If you only specify one job, you still get a
cluster, but it has only one process.
A Condor “Job ID” is the cluster number, a
period, and the process number (“23.5”)
Process numbers always start at 0.
www.cs.wisc.edu/condor
Example Submit Description
File for a Cluster
# Example condor_submit input file that defines
# a whole cluster of jobs at once
Universe
= standard
Executable = simple
Output
= my_job.stdout
Error
= my_job.stderr
Log
= my_job.log
Arguments = -arg1 -arg2
InitialDir = /home/roy/condor/run.$(Process)
Queue 500
www.cs.wisc.edu/condor
Questions So Far?
www.cs.wisc.edu/condor
condor_q
› Find out status of your jobs, from
your condor_schedd.
› condor_q cluster: all jobs in a cluster
› condor_q cluster.proc: particular job
› condor_q –sub name: jobs for a
particular user
www.cs.wisc.edu/condor
Temporarily halt a Job
› Use condor_hold to place a job on
hold
Kills job if currently running
Will not attempt to restart job until
released
› Use condor_release to remove a hold
and permit job to be scheduled again
www.cs.wisc.edu/condor
condor_rm
› You submitted a job, but you want to
cancel it
› condor_rm clusterid
Condor_rm 6:
all jobs in cluster
condor_rm 6.3:
specific job
› condor_rm clusterid.procid
› condor_rm –all: all of your jobs
› Can only remove your jobs
› Reflected in job log
www.cs.wisc.edu/condor
condor_status
› Find status of pool from condor_collector
(simplified view here)
Name
OpSys
Arch
carmi.cs.wisc LINUX
INTEL
coral.cs.wisc LINUX
INTEL
doc.cs.wisc.e LINUX
INTEL
dsonokwa.cs.w LINUX
INTEL
...
Machines Owner Claimed
LINUX
12
2
0
SOLARIS28
5
0
0
Total
17
2
0
State
Unclaimed
Unclaimed
Unclaimed
Unclaimed
Activity
Idle
Idle
Idle
Idle
Unclaimed
10
5
15
www.cs.wisc.edu/condor
condor_status
› condor_status –run: which machines
are running jobs
› condor_status –sub: whose jobs are
running?
› condor_status –constraint: restrict
to showing subset as defined by user
www.cs.wisc.edu/condor
DAGMan
› Directed Acyclic Graph Manager
› DAGMan allows you to specify the
dependencies between your Condor jobs, so
it can manage them automatically for you.
› (e.g., “Don’t run job “B” until job “A” has
completed successfully.”)
www.cs.wisc.edu/condor
What is a DAG?
› A DAG is the data structure
used by DAGMan to represent
these dependencies.
› Each job is a “node” in the
DAG.
› Each node can have any
number of “parent” or
“children” nodes – as long as
there are no loops!
Job A
Job B
Job C
Job D
www.cs.wisc.edu/condor
Defining a DAG
› A DAG is defined by a .dag file, listing each of its
nodes and their dependencies:
# diamond.dag
Job A a.sub
Job B b.sub
Job C c.sub
Job D d.sub
Parent A Child B C
Parent B C Child D
Job A
Job B
Job C
Job D
› each node will run the Condor job specified by its
accompanying Condor submit file
www.cs.wisc.edu/condor
Submitting a DAG
› To start your DAG, just run condor_submit_dag
with your .dag file, and Condor will start a personal
DAGMan daemon which to begin running your jobs:
% condor_submit_dag diamond.dag
› condor_submit_dag submits a Scheduler Universe
Job with DAGMan as the executable.
› Thus the DAGMan daemon itself runs as a Condor
job, so you don’t have to baby-sit it.
www.cs.wisc.edu/condor
Running a DAG
› DAGMan acts as a “meta-scheduler”,
managing the submission of your jobs to
Condor based on the DAG dependencies.
A
Condor A
Job
Queue
B
C
DAGMan D
www.cs.wisc.edu/condor
.dag
File
Running a DAG (cont’d)
› DAGMan holds & submits jobs to the
Condor queue at the appropriate times.
A
Condor B
Job
C
Queue
B
C
DAGMan D
www.cs.wisc.edu/condor
Running a DAG (cont’d)
› In case of a job failure, DAGMan continues until it
can no longer make progress, and then creates a
“rescue” file with the current state of the DAG.
A
Condor
Job
Queue
B
X
DAGMan D
www.cs.wisc.edu/condor
Rescue
File
Recovering a DAG
› Once the failed job is ready to be re-run,
the rescue file can be used to restore the
prior state of the DAG.
A
Condor
Job
C
Queue
C
B
DAGMan D
www.cs.wisc.edu/condor
Rescue
File
Recovering a DAG (cont’d)
› Once that job completes, DAGMan will
continue the DAG as if the failure never
happened.
A
Condor
Job
D
Queue
B
C
DAGMan D
www.cs.wisc.edu/condor
Finishing a DAG
› Once the DAG is complete, the DAGMan
job itself is finished, and exits.
A
Condor
Job
Queue
B
C
DAGMan D
www.cs.wisc.edu/condor
Additional DAGMan
Features
› Provides other handy features
for job management…
nodes can have PRE & POST scripts
failed nodes can be automatically re-
tried a configurable number of times
job submission can be “throttled”
www.cs.wisc.edu/condor
Questions So Far?
www.cs.wisc.edu/condor
What if each job
needed to run for 20
days?
What if I wanted to
interrupt a job with a
higher priority job?
www.cs.wisc.edu/condor
Condor’s Standard Universe
to the rescue!
› Condor can support various combinations of
›
features/environments in different
“Universes”
Different Universes provide different
functionality for your job:
 Vanilla—runs any Serial Job
 Java—well suited for Java programs
Standard – Support for transparent
process checkpoint and restart
www.cs.wisc.edu/condor
Process Checkpointing
› Condor’s Process Checkpointing
mechanism saves all the state of a
process into a checkpoint file
 Memory, CPU, I/O, etc.
› The process can then be restarted from
›
right where it left off
Typically no changes to your job’s source
code needed – however, your job must be
relinked with Condor’s Standard Universe
support library
www.cs.wisc.edu/condor
Linking for Standard
Universe
To do this, just place “condor_compile”
in front of the command you normally
use to link your job:
condor_compile gcc -o myjob myjob.c
OR
condor_compile f77 -o myjob filea.f fileb.f
www.cs.wisc.edu/condor
Limitations in the
Standard Universe
› Condor’s checkpointing is not at the
kernel level. Thus in the Standard
Universe the job may not
Fork()
Use kernel threads
Use some forms of IPC, such as pipes
and shared memory
› Many typical scientific jobs are OK
www.cs.wisc.edu/condor
When will Condor
checkpoint your job?
› Periodically, if desired
 For fault tolerance
› To free the machine to do a higher priority
task (higher priority job, or a job from a
user with higher priority)
 Preemptive-resume scheduling
› When you explicitly run condor_checkpoint,
condor_vacate, condor_off or
condor_restart command
www.cs.wisc.edu/condor
Administering Condor
› Condor provides extensive
configuration files
One per pool, one per machine, or
anything in between
› Extensive documentation
Online manual
Heavily commented sample configuration
file
www.cs.wisc.edu/condor
Policy Configuration
(Boss Fat Cat)
I am adding nodes to
the Cluster… but the
Chemistry Department
has priority on these
nodes.
www.cs.wisc.edu/condor
The Machine (Startd)
Policy Expressions
START – When is this machine willing to
start a job
RANK - Job Preferences
SUSPEND - When to suspend a job
CONTINUE - When to continue a suspended
job
PREEMPT – When to nicely stop running a job
KILL - When to immediately kill a
preempting job
www.cs.wisc.edu/condor
Freida’s Current Settings
START = True
RANK =
SUSPEND = False
CONTINUE =
PREEMPT = False
KILL = False
www.cs.wisc.edu/condor
Freida’s New Settings for
the Chemistry nodes
START = True
RANK = Department == “Chemistry”
SUSPEND = False
CONTINUE =
PREEMPT = False
KILL = False
www.cs.wisc.edu/condor
Submit file with Custom
Attribute
Executable = chem-job
Universe = standard
+Department = Chemistry
queue
www.cs.wisc.edu/condor
What if “Department” not
specified?
START = True
RANK = Department =!= UNDEFINED &&
Department == “Chemistry”
SUSPEND = False
CONTINUE =
PREEMPT = False
KILL = False
www.cs.wisc.edu/condor
Another example
START = True
RANK = Department =!= UNDEFINED &&
((Department == “Chemistry”)*2 +
Department == “Physics”)
SUSPEND = False
CONTINUE =
PREEMPT = False
KILL = False
www.cs.wisc.edu/condor
Policy Configuration, cont
(Boss Fat Cat)
The Cluster is fine.
But not the desktop
machines. Condor can
only use the desktops
when they would
otherwise be idle.
www.cs.wisc.edu/condor
So Frieda decides she
wants the desktops to:
› START jobs when their has been no
›
›
›
activity on the keyboard/mouse for 5
minutes and the load average is low
SUSPEND jobs as soon as activity is
detected
PREEMPT jobs if the activity continues for
5 minutes or more
KILL jobs if they take more than 5 minutes
to preempt
www.cs.wisc.edu/condor
Macros in the Config File
NonCondorLoadAvg = (LoadAvg - CondorLoadAvg)
BackgroundLoad
= 0.3
HighLoad = 0.5
KeyboardBusy = (KeyboardIdle < 10)
CPU_Idle = ($(NonCondorLoadAvg) <= $(Background))
MachineBusy = ($(CPU_Busy) || $(KeyboardBusy))
ActivityTimer
= (CurrentTime EnteredCurrentActivity)
www.cs.wisc.edu/condor
Desktop Machine Policy
START = $(CPU_Idle) && KeyboardIdle > 300
SUSPEND = $(MachineBusy)
CONTINUE = $(CPU_Idle) && KeyboardIdle >
120
PREEMPT = (Activity == "Suspended") &&
$(ActivityTimer) > 300
KILL = $(ActivityTimer) > 300
www.cs.wisc.edu/condor
Policy Review
› Users submitting jobs can specify
›
›
›
›
Requirements and Rank expressions
Administrators can specify Startd Policy
expressions individually for each machine
(Start,Suspend,etc)
Expressions can use any job or machine
ClassAd attribute
Custom attributes easily added
Bottom Line: Enforce almost any policy!
www.cs.wisc.edu/condor
›
›
›
›
›
›
›
Administrator Commands
condor_vacate
condor_on
condor_off
condor_reconfig
condor_config_val
condor_userprio
condor_stats
Leave a machine now
Start Condor
Stop Condor
Reconfig on-the-fly
View/set config
User Priorities
View detailed usage
accounting stats
www.cs.wisc.edu/condor
Questions So Far?
www.cs.wisc.edu/condor
Security in Condor
› Since version 6.3.3, Condor has greatly
improved security
› Multiple authentication methods:
X509 (Using GSI)
Kerberos
Filesystem (shared filesystem, known user)
› Encryption:
3DES
Blowfish
www.cs.wisc.edu/condor
Security in Condor
› Authentication
Based on users, with optional wildcards
• [email protected]
• *@cs.wisc.edu
Users can be given different permissions:
• Read
• Write
• Administrator
• Config
www.cs.wisc.edu/condor
Version Numbers in Condor
› Odd minor numbers are development
releases:
6.3.1, 6.3.2, 6.5.0…
Compatibility not guaranteed within a
series, like 6.3.x.
› Even minor numbers are stable
releases
6.2.2, 6.4.0, 6.4.1…
Compatibility guaranteed within a series,
like 6.4.x.
www.cs.wisc.edu/condor
Questions?
Comments?
› Web: www.cs.wisc.edu/condor
› Email: [email protected]
www.cs.wisc.edu/condor