Using and Administering Condor Alain Roy Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor 24-June-2002 Добрый вечер! › Thank you for having me! › I am: Alain Roy Computer Science Ph.D. in Quality of Service, with Globus Project Working with the Condor Project www.cs.wisc.edu/condor Condor Tutorials Remaining › Monday (Today) 17:00-19:00 › Tuesday 17:00-19:00 Using and administering Condor Using Condor on the Grid www.cs.wisc.edu/condor Review: What is Condor? › Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility. Run lots of jobs over a long period of time, Not a short burst of “high-performance” › Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy www.cs.wisc.edu/condor Condor Takes Care of You › Condor does whatever it takes to run your jobs, even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone else www.cs.wisc.edu/condor What is Unique about Condor? › › › › › ClassAds Transparent checkpoint/restart Remote system calls Works in heterogeneous clusters Clusters can be: Dedicated Opportunistic www.cs.wisc.edu/condor What’s Condor Good For? › Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and sends you email when they complete Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. Condor can handle inter-job dependencies (DAGMan) www.cs.wisc.edu/condor What’s Condor Good For? (cont’d) › Robustness Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion If an execute machine crashes, you only lose work done since the last checkpoint Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover (Story) www.cs.wisc.edu/condor What’s Condor Good For? (cont’d) › Giving your job the agility to access more computing resources Checkpointing allows your job to run on “opportunistic resources” (not dedicated) Checkpointing also provides “migration” - if a machine is no longer available, move! With remote system calls, run on systems which do not share a filesystem - You don’t even need an account on a machine where your job executes www.cs.wisc.edu/condor Other Condor features › Implement your policy on when the jobs can run on your workstation › Implement your policy on the execution order of the jobs › Keep a log of your job activities www.cs.wisc.edu/condor A Condor Pool In Action www.cs.wisc.edu/condor A Bit of Condor Philosophy › Condor brings more computing to everyone A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done. A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines. www.cs.wisc.edu/condor The Idea Computing power is everywhere, we try to make it usable by anyone. www.cs.wisc.edu/condor Remember Frieda? Today we’ll revisit Frieda’s Condor explorations in more depth www.cs.wisc.edu/condor I have 600 simulations to run. Where can I get help? www.cs.wisc.edu/condor Install a Personal Condor! www.cs.wisc.edu/condor Installing Condor › Download Condor for your operating › › system Available as a free download from http://www.cs.wisc.edu/condor Available for most Unix platforms and Windows NT www.cs.wisc.edu/condor So Frieda Installs Personal Condor on her machine… › What do we mean by a “Personal” Condor? Condor on your own workstation, no root access required, no system administrator intervention needed—easy to set up. www.cs.wisc.edu/condor Personal Condor?! What’s the benefit of a Condor “Pool” with just one user and one machine? www.cs.wisc.edu/condor Your Personal Condor will ... › Keep an eye on your jobs and will keep you › › › posted on their progress Keep a log of your job activities Add fault tolerance to your jobs Implement your policy on when the jobs can run on your workstation www.cs.wisc.edu/condor What’s in a Personal Condor? › Everything that is in Condor, just one machine. › Condor daemons: Condor_master Condor_collector—Stores ClassAds for jobs, machines Condor_negotiator—Matchmaking Condor_schedd—Submits, monitors jobs Condor_startd—Starts jobs Condor_starter—Launches a job Condor_shadow—Monitors remote job www.cs.wisc.edu/condor A Condor Pool of One Condor_master Condor_negotiator Condor_schedd Condor_collector Condor_shadow Condor_startd Condor_starter Condor job www.cs.wisc.edu/condor condor_master › Starts up all other Condor daemons › If there are any problems and a daemon › exits, it restarts the daemon and sends email to the administrator Checks the time stamps on the binaries of the other Condor daemons, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version www.cs.wisc.edu/condor condor_master (cont’d) › Acts as the server for many Condor remote administration commands: condor_reconfig, condor_restart, condor_off, condor_on, condor_config_val, etc. www.cs.wisc.edu/condor condor_startd › Represents a machine to the Condor system › Responsible for starting, suspending, and stopping jobs › Enforces the wishes of the machine owner (the owner’s “policy”… more on this soon) www.cs.wisc.edu/condor condor_schedd › Represents users to the Condor system › Maintains the persistent queue of jobs › Responsible for contacting available › machines and sending them jobs Services user commands which manipulate the job queue: condor_submit,condor_rm, condor_q, condor_hold, condor_release, condor_prio, … www.cs.wisc.edu/condor condor_collector › Collects information from all other Condor daemons in the pool “Directory Service” / Database for a Condor pool › Each daemon sends a periodic update called › a “ClassAd” to the collector Services queries for information: Queries from other Condor daemons Queries from users (condor_status) www.cs.wisc.edu/condor condor_negotiator › Performs “matchmaking” in Condor › Gets information from the collector about › › all available machines and all idle jobs Tries to match jobs with machines that will serve them Both the job and the machine must satisfy each other’s requirements www.cs.wisc.edu/condor Frieda wants more… › She decides to use the graduate students’ computers when they aren’t, and get done sooner. › In exchange, they can use the Condor pool too. www.cs.wisc.edu/condor Frieda’s Condor pool… Frieda’s Computer: Central Manager Graduate Student’s Desktop Computers www.cs.wisc.edu/condor A larger Condor pool Submitter Collector Condor_master Condor_master Condor_schedd Condor_negotiator Condor_shadow Condor_collector Executor Submitter/Executor Condor_master Condor_master Condor_startd Condor_schedd Condor_startd Condor_starter Condor_shadow Condor_starter Condor Job Condor Job www.cs.wisc.edu/condor Happy Day! Frieda’s organization purchased a Beowulf Cluster! › Other scientists in her department have realized the power of Condor and want to share it.. › The Beowulf cluster and the graduate student computers can be part of a single Condor pool. www.cs.wisc.edu/condor Frieda’s Condor pool… Graduate Student’s Desktop Computers Central Manager Beowulf Cluster www.cs.wisc.edu/condor How would you set it up? › Grad student machines: Submitters Executors › Beowulf cluster machines Executors only › Independent machine for collector/neg Big job—take it away from Freida’s computer Could split collector and negotiator www.cs.wisc.edu/condor Frieda collaborates… › She wants to share her Condor pool with scientists from another lab. www.cs.wisc.edu/condor Condor Flocking › Condor pools can work cooperatively www.cs.wisc.edu/condor How would you set it up? › Two independent pools Each has it’s own collector/negotiator › Set up flocking from one pool to another: by machine, or by pool. FLOCK_TO <machine> FLOCK_FROM <machine> › Can be uni- or bi-directional www.cs.wisc.edu/condor Questions So Far? www.cs.wisc.edu/condor How do you run a job? › It doesn’t matter if you have: Personal Condor Large Condor pool Condor pool with flocking › Four steps 1. Write program 2. Write submit file 3. Give it to Condor 4. Condor gives you the results www.cs.wisc.edu/condor Step 1: Writing a program › Condor has universes Vanilla Universe: • Run anything • Less capable Java Universe: Works better for Java Standard Universe: • Checkpointing • Remote I/O • Can’t work with all programs www.cs.wisc.edu/condor Step 1: Vanilla Universe › You can run any program C/C++/Perl/Python/Fortran/Java/Lisp… No checkpointing: if your job is interrupted or the machine crashes, Condor has to restart it from the beginning. Can do anything you could do if you were logged in. www.cs.wisc.edu/condor Step 1: Java Universe › Works better for Java programs › Checks for valid Java environment › Distinguishes Java environment exceptions from program exceptions (wrapper program) › No checkpointing (it could happen though) › Remote I/O www.cs.wisc.edu/condor Step 1: Standard Universe › Requires re-linking your program condor_compile gcc –o simple simple.o › Allows checkpointing and remote I/O › Restrictions on behavior No threading Limited networking Restrictions on compiler used www.cs.wisc.edu/condor Step 2: Write submit file Executable = Universe = Arguments = Log = Output = Error = Requirements Queue simple vanilla First simple.log simple.output simple.error = Memory > 512 Note: This assumes a shared filesystem www.cs.wisc.edu/condor Step 2: Write submit file Executable = simple Universe = vanilla Arguments = First Log = simple.log Output = simple.output Error = simple.error Transfer_input_files = data.in Transfer_output_files = data.out Requirements = Memory > 512 Queue Note: This does not assume a shared filesystem www.cs.wisc.edu/condor Step 2: Write submit file Executable = Universe = Arguments = Log = Output = Error = Requirements Queue simple standard First simple.log simple.output simple.error = Memory > 512 Note: This does not assume a shared filesystem, but remote I/O www.cs.wisc.edu/condor Step 2: Submit Files › Condor is helpful: it makes a real requirements: Requirements = memory > 512 becomes… Requirements = (OpSys == “Linux”) && (memory > 512) && <shared-filesystem>… › Queue can take a parameter (more later) › A single file can submit many jobs www.cs.wisc.edu/condor Step 3: Give it to Condor › condor_submit submit.desc › condor_q -- Submitter: dsonokwa.cs.wisc.edu : <128.105.175.130:36280> : dsonokwa.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 5.0 roy 6/15 20:51 0+00:00:02 R 0 0.0 simple First 1 jobs; 0 idle, 1 running, 0 held www.cs.wisc.edu/condor Step 4: Condor gives it back › The program’s output is where you asked it to be. › Condor left a log file documenting what it did. › Condor optionally sends you an email telling you it’s done. www.cs.wisc.edu/condor Step 4: Condor gives it back 000 (34364.000.000) 06/15 21:00:01 Job submitted from host: <128.105.146.14:34918> 001 (34364.000.000) 06/15 21:00:01 Job executing on host: <128.105.146.36:34918> 005 (34364.000.000) 06/15 21:00:06 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage www.cs.wisc.edu/condor Step 4: Condor gives it back Date: Sat, 15 Jun 2002 21:00:06 -0500 (CDT) From: Condor Project <[email protected]> Message-Id: <[email protected]> To: [email protected] Subject: [Condor] Condor Job 34364.0 This is an automated email from the Condor system on machine "beak.cs.wisc.edu". Do not reply. Your condor job exited with status 0. Job: /scratch/roy/condor/simple/simple First www.cs.wisc.edu/condor Clusters and Processes › If your submit file describes multiple jobs, › › › › we call this a “cluster”. Each job within a cluster is called a “process” or “proc”. If you only specify one job, you still get a cluster, but it has only one process. A Condor “Job ID” is the cluster number, a period, and the process number (“23.5”) Process numbers always start at 0. www.cs.wisc.edu/condor Example Submit Description File for a Cluster # Example condor_submit input file that defines # a whole cluster of jobs at once Universe = standard Executable = simple Output = my_job.stdout Error = my_job.stderr Log = my_job.log Arguments = -arg1 -arg2 InitialDir = /home/roy/condor/run.$(Process) Queue 500 www.cs.wisc.edu/condor Questions So Far? www.cs.wisc.edu/condor condor_q › Find out status of your jobs, from your condor_schedd. › condor_q cluster: all jobs in a cluster › condor_q cluster.proc: particular job › condor_q –sub name: jobs for a particular user www.cs.wisc.edu/condor Temporarily halt a Job › Use condor_hold to place a job on hold Kills job if currently running Will not attempt to restart job until released › Use condor_release to remove a hold and permit job to be scheduled again www.cs.wisc.edu/condor condor_rm › You submitted a job, but you want to cancel it › condor_rm clusterid Condor_rm 6: all jobs in cluster condor_rm 6.3: specific job › condor_rm clusterid.procid › condor_rm –all: all of your jobs › Can only remove your jobs › Reflected in job log www.cs.wisc.edu/condor condor_status › Find status of pool from condor_collector (simplified view here) Name OpSys Arch carmi.cs.wisc LINUX INTEL coral.cs.wisc LINUX INTEL doc.cs.wisc.e LINUX INTEL dsonokwa.cs.w LINUX INTEL ... Machines Owner Claimed LINUX 12 2 0 SOLARIS28 5 0 0 Total 17 2 0 State Unclaimed Unclaimed Unclaimed Unclaimed Activity Idle Idle Idle Idle Unclaimed 10 5 15 www.cs.wisc.edu/condor condor_status › condor_status –run: which machines are running jobs › condor_status –sub: whose jobs are running? › condor_status –constraint: restrict to showing subset as defined by user www.cs.wisc.edu/condor DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”) www.cs.wisc.edu/condor What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job B Job C Job D www.cs.wisc.edu/condor Defining a DAG › A DAG is defined by a .dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job B Job C Job D › each node will run the Condor job specified by its accompanying Condor submit file www.cs.wisc.edu/condor Submitting a DAG › To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs: % condor_submit_dag diamond.dag › condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable. › Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it. www.cs.wisc.edu/condor Running a DAG › DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. A Condor A Job Queue B C DAGMan D www.cs.wisc.edu/condor .dag File Running a DAG (cont’d) › DAGMan holds & submits jobs to the Condor queue at the appropriate times. A Condor B Job C Queue B C DAGMan D www.cs.wisc.edu/condor Running a DAG (cont’d) › In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. A Condor Job Queue B X DAGMan D www.cs.wisc.edu/condor Rescue File Recovering a DAG › Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. A Condor Job C Queue C B DAGMan D www.cs.wisc.edu/condor Rescue File Recovering a DAG (cont’d) › Once that job completes, DAGMan will continue the DAG as if the failure never happened. A Condor Job D Queue B C DAGMan D www.cs.wisc.edu/condor Finishing a DAG › Once the DAG is complete, the DAGMan job itself is finished, and exits. A Condor Job Queue B C DAGMan D www.cs.wisc.edu/condor Additional DAGMan Features › Provides other handy features for job management… nodes can have PRE & POST scripts failed nodes can be automatically re- tried a configurable number of times job submission can be “throttled” www.cs.wisc.edu/condor Questions So Far? www.cs.wisc.edu/condor What if each job needed to run for 20 days? What if I wanted to interrupt a job with a higher priority job? www.cs.wisc.edu/condor Condor’s Standard Universe to the rescue! › Condor can support various combinations of › features/environments in different “Universes” Different Universes provide different functionality for your job: Vanilla—runs any Serial Job Java—well suited for Java programs Standard – Support for transparent process checkpoint and restart www.cs.wisc.edu/condor Process Checkpointing › Condor’s Process Checkpointing mechanism saves all the state of a process into a checkpoint file Memory, CPU, I/O, etc. › The process can then be restarted from › right where it left off Typically no changes to your job’s source code needed – however, your job must be relinked with Condor’s Standard Universe support library www.cs.wisc.edu/condor Linking for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: condor_compile gcc -o myjob myjob.c OR condor_compile f77 -o myjob filea.f fileb.f www.cs.wisc.edu/condor Limitations in the Standard Universe › Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not Fork() Use kernel threads Use some forms of IPC, such as pipes and shared memory › Many typical scientific jobs are OK www.cs.wisc.edu/condor When will Condor checkpoint your job? › Periodically, if desired For fault tolerance › To free the machine to do a higher priority task (higher priority job, or a job from a user with higher priority) Preemptive-resume scheduling › When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command www.cs.wisc.edu/condor Administering Condor › Condor provides extensive configuration files One per pool, one per machine, or anything in between › Extensive documentation Online manual Heavily commented sample configuration file www.cs.wisc.edu/condor Policy Configuration (Boss Fat Cat) I am adding nodes to the Cluster… but the Chemistry Department has priority on these nodes. www.cs.wisc.edu/condor The Machine (Startd) Policy Expressions START – When is this machine willing to start a job RANK - Job Preferences SUSPEND - When to suspend a job CONTINUE - When to continue a suspended job PREEMPT – When to nicely stop running a job KILL - When to immediately kill a preempting job www.cs.wisc.edu/condor Freida’s Current Settings START = True RANK = SUSPEND = False CONTINUE = PREEMPT = False KILL = False www.cs.wisc.edu/condor Freida’s New Settings for the Chemistry nodes START = True RANK = Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False www.cs.wisc.edu/condor Submit file with Custom Attribute Executable = chem-job Universe = standard +Department = Chemistry queue www.cs.wisc.edu/condor What if “Department” not specified? START = True RANK = Department =!= UNDEFINED && Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False www.cs.wisc.edu/condor Another example START = True RANK = Department =!= UNDEFINED && ((Department == “Chemistry”)*2 + Department == “Physics”) SUSPEND = False CONTINUE = PREEMPT = False KILL = False www.cs.wisc.edu/condor Policy Configuration, cont (Boss Fat Cat) The Cluster is fine. But not the desktop machines. Condor can only use the desktops when they would otherwise be idle. www.cs.wisc.edu/condor So Frieda decides she wants the desktops to: › START jobs when their has been no › › › activity on the keyboard/mouse for 5 minutes and the load average is low SUSPEND jobs as soon as activity is detected PREEMPT jobs if the activity continues for 5 minutes or more KILL jobs if they take more than 5 minutes to preempt www.cs.wisc.edu/condor Macros in the Config File NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BackgroundLoad = 0.3 HighLoad = 0.5 KeyboardBusy = (KeyboardIdle < 10) CPU_Idle = ($(NonCondorLoadAvg) <= $(Background)) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = (CurrentTime EnteredCurrentActivity) www.cs.wisc.edu/condor Desktop Machine Policy START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND = $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT = (Activity == "Suspended") && $(ActivityTimer) > 300 KILL = $(ActivityTimer) > 300 www.cs.wisc.edu/condor Policy Review › Users submitting jobs can specify › › › › Requirements and Rank expressions Administrators can specify Startd Policy expressions individually for each machine (Start,Suspend,etc) Expressions can use any job or machine ClassAd attribute Custom attributes easily added Bottom Line: Enforce almost any policy! www.cs.wisc.edu/condor › › › › › › › Administrator Commands condor_vacate condor_on condor_off condor_reconfig condor_config_val condor_userprio condor_stats Leave a machine now Start Condor Stop Condor Reconfig on-the-fly View/set config User Priorities View detailed usage accounting stats www.cs.wisc.edu/condor Questions So Far? www.cs.wisc.edu/condor Security in Condor › Since version 6.3.3, Condor has greatly improved security › Multiple authentication methods: X509 (Using GSI) Kerberos Filesystem (shared filesystem, known user) › Encryption: 3DES Blowfish www.cs.wisc.edu/condor Security in Condor › Authentication Based on users, with optional wildcards • [email protected] • *@cs.wisc.edu Users can be given different permissions: • Read • Write • Administrator • Config www.cs.wisc.edu/condor Version Numbers in Condor › Odd minor numbers are development releases: 6.3.1, 6.3.2, 6.5.0… Compatibility not guaranteed within a series, like 6.3.x. › Even minor numbers are stable releases 6.2.2, 6.4.0, 6.4.1… Compatibility guaranteed within a series, like 6.4.x. www.cs.wisc.edu/condor Questions? Comments? › Web: www.cs.wisc.edu/condor › Email: [email protected] www.cs.wisc.edu/condor
© Copyright 2026 Paperzz