The Condor Daemons - HEPiX Services at CASPUR

Installing and Managing
a Large Condor Pool
Derek Wright
Computer Sciences Department
University of Wisconsin-Madison
[email protected]
www.cs.wisc.edu/condor
Talk Outline
 What
is Condor and why is it good for
large clusters?
• The Condor Daemons (the sys admin view)
• A look at the UW-Madison Computer
Science Condor Pool and Cluster
• Some other features of Condor that help
for big pools
• Future work
2
What is Condor?
A
system of daemons and tools that
harness desktop machines and commodity
computing resources for High Throughput
Computing
• Large numbers of jobs over long periods
of time
• Not High Performance Computing, which
is short bursts of lots of compute power
3
What is Condor? (Cont’d)
 Condor
matches jobs with available
machines using “ClassAds”
• “Available machines” can be:
– Idle desktop workstations
– Dedicated clusters
– SMP machines
 Can
also provide checkpointing and process
migration (if you re-link your application
against our library)
4
What’s Condor Good For?
 Managing
a large number of jobs
• You specify the jobs in a file and submit
them to Condor, which runs them all and
sends you email when they complete
• Mechanisms to help you manage huge
numbers of jobs (1000’s), all the data, etc.
• Condor can handle inter-job dependencies
(DAGMan)
5
What’s Condor Good For? (cont’d)
 Managing
a large number of machines
• Condor daemons run on all the machines in
your pool and are constantly monitoring
machine state
• You can query Condor for information
about your machines
• Condor handles all background jobs in your
pool with minimal impact on your machine
owners
6
Why is Condor Good for Large
Clusters?
 Fault-Tolerance
at all levels of Condor
• Even “dedicated” resources should be
treated like they might disappear at any
minute (Condor has been doing this since
1985… we’ve got a lot of experience)
• Checkpointing jobs (when possible) makes
scheduling a lot easier, and ensures
forward progress
 Eases monitoring
7
Condor on Large Clusters (cont’d)
 Manages
ALL your resources and jobs
under one system
• Easier for users and administrators
 Easy to install and use
• No queues to configure or choose from
 It’s developed by former system
administrators (all the full-time staff)
 It’s free (that scales really well)
8
What is a Condor Pool?
 “Pool”
can be a single machine or a group
of machines
 Determined by a “central manager” - the
matchmaker and centralized information
repository
 Each machine runs various daemons to
provide different services, either to the
users who submit jobs, the machine
owners, or the pool itself
9
Talk Outline
• What is Condor and why is it good for
large clusters?
 The Condor Daemons (the sys admin view)
• A look at the UW-Madison Computer
Science Condor Pool and Cluster
• Some other features of Condor that help
for big pools
• Future work
10
The Condor Daemons
condor_master
condor_collector
condor_negotiator
condor_startd
condor_schedd
condor_starter
condor_shadow
Administrator Agent
condor_dagman
condor_eventd
Manage Inter-Job Dependencies
Centralized Repository of ClassAds
Performs Matchmaking
Resource Agent (Machine)
User Agent (Jobs)
Monitors/Manages a Job Process
Handles Remote System Calls,
Intra-Job Resource Management
Pool-Wide Events
11
Layout of a Personal Condor Pool
= Process Spawned
= ClassAd
Communication
Pathway
Central Manager
master
startd
schedd
negotiator
collector
12
Layout of a General Condor Pool
= Process Spawned
= ClassAd
Communication
Pathway
master
master
startd
Submit-Only
Execute-Only
Central Manager
schedd
negotiator
collector
master
schedd
startd
Execute-Only
master
startd
Regular Node
master
startd
schedd
Regular Node
master
startd
schedd
13
condor_master daemon
 Starts
up all other Condor daemons
 If there are any problems and a daemon
exists, it restarts the daemon and sends
email to the administrator
 Checks the time stamps on the binaries it
is configured to spawn, and if new binaries
appear, the master will gracefully
shutdown the currently running version and
start the new version
14
condor_master (cont’d)
 Provides
access to many remote
administration commands:
• condor_reconfig
• condor_restart, condor_off, condor_on
 Default server for many other commands:
• condor_config_val, etc.
 Periodically runs condor_preen to clean up
any files Condor might have left on the
machine (the rest of the daemons clean up
after themselves, as well)
15
condor_collector
 Collects
information from all other Condor
daemons in the pool
 Each daemon sends a periodic update
called a “ClassAd” to the collector
 Services queries for information:
• Queries from other Condor daemons
• Queries from users (condor_status)
 Can store historical pool data
16
17
condor_eventd
 Administrators
specify events in a config
file (similar to a crontab, but not
exactly):
• Date and time
• What kind of event (currently, only
“shutdown” is supported)
• What machines the event effects
(ClassAd constraint)
18
condor_eventd (cont’d)
 When
event is approaching, EventD will
wake up and query the condor_collector
for all machines that match the constraint
 EventD then knows how big all the jobs
are that are currently running on the
effected nodes, network bandwidth to the
nearest checkpoint servers, etc.
 EventD plans evictions to allow the most
computation w/o flooding the net
19
Talk Outline
• What is Condor and why is it good for
large clusters?
• The Condor Daemons (the sys admin view)
 A look at the UW-Madison Computer
Science Condor Pool and Cluster
• Some other features of Condor that help
for big pools
• Future work
20
Large Condor Pools in HEP and
Government Research
 UW-Madison
CS (~750 nodes)
 INFN (~270 nodes)
 CERN/Chorus (~100 nodes)
 NASA Ames (~330 nodes)
 NCSA (~200 nodes)
21
Layout of the UW-Madison Pool
Flocking to
other Pools
Dedicated
Scheduler
Dedicated Linux
Cluster (~200 cpus)
Checkpoint Server
Central Manager
Instructional
Computer Labs
(~225 cpus)
Submitonly
machines
at
other sites
EventD
Desktop
Workstations
(~325 cpus)
Checkpoint Server
22
Composition of the UW/CS Cluster
 Current
cluster: 100 Dual XEON 550MHz
with 1 gig of RAM (tower cases)
 New nodes being installed: 150 Dual
933MHz Pentium III, 36 nodes w/ 2 gigs
of RAM, the rest w/ 1 gig (2U racks)
 100 Mbit Switched Ethernet to nodes
 Gigabit Ethernet to the file servers and
checkpoint server
23
Composition of the rest of the
UW/CS Pool
 Instructional
Labs
• 60 Intel/Linux
• 60 Sparc/Solaris
• 105 Intel/NT
 “Desktop Workstations”
• Includes 12 and 8-way Ultra E6000s,
other SMPs, and real desktops, etc.
 Central Manager - 600MHz Pentium III
running Solaris, 512 Megs RAM
24
Talk Outline
• What is Condor and why is it good for
large clusters?
• The Condor Daemons (the sys admin view)
• A look at the UW-Madison Computer
Science Condor Pool and Cluster
 Some other features of Condor that help
for big pools
• Future work
25
Condor’s Configuration
 Condor’s
configuration is a concatenation
of multiple files, in order - definitions in
later files overwrites previous definitions
 Layout and purpose of the different files:
• Global config file
• Other shared files
• Local config file
26
Global Config File
 All
shared settings across your entire pool
 Found either in file pointed to with the
CONDOR_CONFIG environment variable,
/etc/condor/condor_config, or the
home directory of the “condor” user
 Most settings can be in this file
 Only works as a “global” file if it is on a
shared file system (HIGHLY recommended
for large sites!)
27
Other shared files
 You
can configure a number of other
shared config files:
• files to hold common settings to make it
easier to maintain (for example, all policy
expressions, which we’ll see later)
• platform-specific config files
28
Local config file
 Any
machine-specific settings
• local policy settings for a given owner
• different daemons to run (for example, on
the Central Manager)
 Can either be on the local disk of each
machine, or have separate files in a
shared directory, each named by hostname
 For large sites: keep them all on AFS or
NFS, and in CVS, if possible
29
Daemon-specific configuration
 You
can also change daemon-specific
settings with condor_config_val
 Use the “-set” option for persistent
changes, or “-rset” for memory-resident
only
 Used by the EventD
 Can be used by other entities for various
remote-administration tasks
30
Advertising Your Own Attributes in
the Machine ClassAd
 Add
new macro(s) to the config file
• This is usually done in the local config file
• Can name the macros anything, so long as
the names don’t conflict with existing ones
 Tell the condor_startd to include these
other macros in the ClassAd it sends out
• Edit the STARTD_EXPRS macro to
include the names of the macros you want
to advertise (comma separated)
31
 You
Host/IP Security in Condor
can configure each machine in your
pool to allow or deny certain actions from
different groups of machines:
• “read” access - querying information
– condor_status, condor_q, etc
• “write” access - updating information
– condor_submit, adding a node to the pool, etc
• “administrator” access
– condor_on, off, reconfig, restart...
• “owner” access
– Things a machine owner can do (vacate)
32
The Different Versions of Condor
 We
distribute two versions of Condor:
• Stable Series
– Heavily tested, recommended for use
– 2nd number of version string is even (6.2.0)
• Development Series
– Latest features, not necessarily well-tested
– 2nd number of version string is odd (6.3.0)
– Not recommended unless you know what you
are doing and/or need a new feature
33
Condor Versions (cont’d)
 All
daemons advertise a CondorVersion
attribute in the ClassAd they publish
 You can also view the version string by
running ident on any Condor binary
 In general, all parts of Condor on a single
machine should run the same version
 Machines in a pool can usually run different
versions and communicate with each other
 It will be made very clear when a version
is incompatible with older versions
34
Talk Outline
• What is Condor and why is it good for
large clusters?
• The Condor Daemons (the sys admin view)
• A look at the UW-Madison Computer
Science Condor Pool and Cluster
• Some other features of Condor that help
for big pools
 Future work
35
Future Work
 User
Authentication and Authorization
• Have Kerberos and X.509 authentication
in beta mode already
• Will integrate w/ Condor tools to get rid
of Host/IP authorization and move to
user-based authorization
• Will enable encrypted channels to securely
move data (including AFS tokens)
36
Future Work (cont’d)
 Digitally
Signed Binaries
• Condor Team will digitally sign binaries we
release
• condor_master will only spawn new
daemons if they are properly signed
 More interesting dedicated scheduling
 Condor RPMs
 Addressing scalability
37
Obtaining Condor
 Condor
can be downloaded from the Condor
web site at:
http://www.cs.wisc.edu/condor
 Complete Users and Administrators manual
available
http://www.cs.wisc.edu/condor/manual
 Contracted Support is available
 Questions? Email:
[email protected]
38