|
Condor Primer
Bruno Miguel Tavares Gonçalves
The Condor Project was created to satisfy a current necessity in state
of the art computational research, the need to process large amounts
of information in an efficient manner. This is achieved by implementing
what is known in the literature as a Hight Throughput Computing enviornment,
that delivers large amounts of computational power over a large period
of time1.
Condor allows for the CPU cycles that are unused in various computers
to be available for general use in an efficient and transparent manner,
thus dramatically improving the use of already existing computational
resources, without affecting their normal use. It also allows for
a number of independent jobs to be simultaneously scheduled and run
concurrently and as fast as resource availability allows it.
Before you can submit a job to Condor you need to write a submit script
that informs Condor of what is required to complete the computational
task. A simple submit script can look something like this:
-
- Executable = hello
Universe = vanilla
Output = hello.out.$(PROCESS)
Input = hello.in.$(PROCESS)
Error = hello.err.$(PROCESS)
Transfer_files = ALWAYS
Log = hello.log
Queue 3
The only necessary lines are the first two and the last one. The meaning
of each line is described bellow:
Executable = hello
- This line tells Condor what binary/script to run. Arguments shouldn't
be passed along with the binary
Transfer_files = ALWAYS
- This indicates to Condor that it should send the files to the remote
machine. This allows your program to run in machines that don't have
all the necessary libraries installed. You should always include this
line, even though it is not mandatory.
Universe = Vanilla
- This is where one would specify the proper universe to use. There
are several possible choices, but the most commonly used is the vanilla
universe. You can look up more details
Input = hello.in.$(PROCESS)
- The contents of this file will be used as <stdin> for this process.
The macro $(PROCESS) is replaced by the process number, starting
at 0. As such, process number 0 will read in hello.in.0, Process number
1 will read in hello.in.1, etc...
Output = hello.out.$(PROCESS)
- Any output generated by the binary file will be redirected to this
file. As before $(PROCESS) stands for the process number.
Error = hello.err.$(PROCESS)
- Any error messages produced by the processes will be redirected to
this file. Similarly to before, $(PROCESS) identifies the process
number.
Log = hello.log
- Condor stores the job log in this file. Any actions taken during the
exectution of this job will be listed here.
Queue N
- This line tells Condor how many different processes to queue. If the
numerical argument is ommited, only one process is added to the execution
queue. The processes thus generated will be run on as many machines
as there are available and for as long as required to complete the
exectution.
All the files mentioned above are expected to be in the same directory
as the submit script and the executable binary. The commands used
to submit and control jobs are described in the next section.
If you want to know how many nodes and cpus are know to Condor you
can run:
-
- condor_status
This command will list all of the nodes that are running in the Condor
that your machine is in. It will also provide basic information about
each node, such as the architecture, operating system, and whether
or not there are any jobs that are currently running.
Condor requires that your binaries are linked with some condor-specific
libraries so that it can do its job efficiently. This is achieved
by calling:
-
- condor_compile COMMAND
where COMMAND stands for the command you would normaly used to compile
your code. So, if you would normaly compile your source code using:
-
- gcc foo.c -o foo.x
you should now use:
-
- condor_compile gcc foo.c -o foo.x
After we have linked our binary with the condor libraries and have
writen the submit script, we need to tell Condor to use it to create
a new job. All you do is type:
-
- condor_submit SCRIPT
If you want to see what the submit command is doing for debugging
purposes, you can simply type:
-
- condor_submit -v SCRIPT
After this command returns all you need to do is wait for your job
to finish running. If everything goes according to plan the output
of your runs should be in the Output files listed in the submit script
and any errors that have occured will be in the Error files. Please
note that the Error files will always be created, even if no errors
occur.
You can check the status of your job by typing:
-
- condor_q
The output of this command will tell you how many jobs are on Condors
queue, which jobs are currently Running and which jobs are Idle. Jobs
can be idle simply because they havent gotten their turn yet, or because
they encountered some sort of problem.
You can check if something went wrong by using:
-
- condor_q -analyze ID
The output should give you an idea of what's going on.
To remove a process from Condors job queue you type:
-
- condor_rm ID
where ID is the jobs ID number that you can get from a previous call
to condor_q
There exist several other commands, and much more to be said about
condor. This tutorial was meant to be just a very basic introduction
to the way Condor operates and after reading it you should understand
enough of Condor to be able to start using it right away. Further
information, very detailed documentation[1] and FAQs can
be found in Condors website: http://www.cs.wisc.edu/condor/
- 1
-
Team Condor.
Condor Version 6.4 Manual.
University of Winsconsin-Madison, August 2002.
Footnotes
- 1
- As opposed to a Hight Performance Computing (HPC) enviornment that
delivers very high performance over short periods of time.
© Copyright 2004 Bruno Goncalves - All rights reserved
 
|