How to build a linux Cluster - Part III
Viewed 1819 times times.
Series table of contents:
- How to build a linux Cluster - Part I
- How to build a linux Cluster - Part II
- How to build a linux Cluster - Part III
This post series documents how I built a powerful and scalable Linux cluster using only free software and off the shelf components. To build our cluster we are going to use three pieces of software:
On the first part of the series, I showed you how you can install DRBL on your server machine. On the second part, I explained how to install Condor on your DRBL cluster so you could easily submit and manage your computing jobs. In this third and final installment I give you a brief introduction on to how you can use your newly created cluster.
Condor tutorial
Condor allows for the CPU cycles that are unused in various computers to be available for general use in an efficient and transparent manner, thus dramatically improving the use of already existing computational resources, without affecting their normal use. It also allows for a number of independent jobs to be simultaneously scheduled and run concurrently and as fast as resource availability allows it.
SCRIPT file
Before you can submit a job to Condor you need to write a submit script that informs Condor of what is required to complete the computational task. A simple submit script can look something like this:
1 2 3 4 5 6 7 8 | Executable = hello Universe = Vanilla Output =hello.out.$(PROCESS) Input = hello.in.$(PROCESS) Error = hello.err.$(PROCESS) Transfer_files = ALWAYS Log = hello.log Queue 3 |
The only necessary lines are the first two and the last one. The meaning
of each line is described bellow:
1 | Executable = hello |
This line tells Condor what binary/script to run. Arguments shouldn’t be passed along with the binary.
6 | Transfer_files = ALWAYS |
This indicates to Condor that it should send the files to the remote machine. This allows your program to run in machines that don’t have all the necessary libraries installed. You should always include this line, even though it is not mandatory.
2 | Universe = Vanilla |
Specify the proper condor universe to use. There are several possible choices, but the most commonly used is the Vanilla universe which allows for the use of any executable file. Another useful choice, is the Standard universe that regularly checkpoints your job and, if anything should go wrong, restarts the job from the last available checkpoint but that requires the executable to be compiled and linked with condor_compile as described bellow.
4 | Input = hello.in.$(PROCESS) |
The contents of this file will be used as <stdin> for this process. The macro $(PROCESS) is replaced by the process number, starting at 0. As such, process number 0 will read in hello.in.0, Process number 1 will read in hello.in.1, etc…
3 | Output = hello.out.$(CLUSTER).$(PROCESS) |
Any output generated by the binary file will be redirected to this file. As before $(PROCESS) stands for the process number. Condor also assigns a unique cluster job number to every set of jobs (every time you use condor_submit. You can retrieve this number with the $(CLUSTER) macro and use it, as in this case, to give every output file a unique file name.
5 | Error = hello.err.$(CLUSTER).$(PROCESS) |
Any error messages produced by the processes will be redirected to this file. Similarly to before, $(PROCESS) identifies the process number.
7 | Log = hello.log |
Condor stores the job log in this file. Any actions taken during the exectution of this job, such as submission, execution, eviction, etc.. will be listed here.
8 | Queue N |
This line tells Condor how many different processes to queue. If the numerical argument is ommited, only one process is added to the execution queue. The processes thus generated will be run on as many machines as there are available and for as long as required to complete the exectution.
All the files mentioned above are expected to be in the same directory as the submit script and the executable binary. Several commands are used to submit and control jobs:
condor_status
To know how many nodes and cpus are known to Condor you can run:
666 | condor_status |
List all of the nodes that are running in the Condor that your machine is in. It will also provide basic information about each node, such as the architecture, operating system, and whether or not there are any jobs that are currently running. When I run condor_status on one of my clusters, I would see (the output is truncated for clarity):
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
mops001 LINUX INTEL Claimed Busy 1.000 502 1+17:58:14
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 42 0 12 30 0 0 0
Total 42 0 12 30 0 0 0Which tells me that machine mops001 is running LINUX on an INTEL compatible CPU with 502MB of RAM and that it has been running a job for 1 day 17 hours 58 minutes and 14 seconds. The final part of the output says that all 42 machines are INTEL/LINUX and that only 12 of them are running jobs with the remaining 30 machines being Idle. The Owner column accounts for how many machines are currently being used by someone logged in directly to them. In our case, this will (almost) always be zero, unless there is some administrative job being run by the system that uses a considerable ammount of CPU time.
condor_compile COMMAND
If you want to take full advantage of Condor’s checkpointing and opportunistic computing abilitie, you must use the Standard universe and link your binaries with some condor-specific libraries. This is achieved by calling:
666 | condor_compile COMMAND |
where COMMAND stands for the command you would normaly used to compile your code. So, if you would normally compile your source code using:
666 | gcc foo.c -o foo.x |
you should now use:
666 | condor_compile gcc foo.c -o foo.x |
condor_submit SCRIPT
After we have linked our binary with the condor libraries and have written an appropriate submit script, we need to tell Condor to use it to create a new job. All you do is type:
666 | condor_submit SCRIPT |
If you want to see what the submit command is doing for debugging
purposes, you can simply type:
666 | condor_submit -v SCRIPT |
After this command returns all that is left for you to do is wait for the job (s) to finish running. If everything goes according to plan the output of your runs should be in the Output files listed in the submit script and any errors that might have occurred will be in the Error files. Please note that the Error files will always be created, even if no errors occur.
condor_q
You can check the status of your job by typing:
666 | condor_q |
The output of this command will tell you how many jobs are on Condors queue, which jobs are currently Running and which jobs are Idle. Jobs can be idle simply because they havent gotten their turn yet, or because they encountered some sort of problem. For example (truncated):
-- Submitter: underdark : <xxx.xxx.xxx.xxx:60572> : underdark ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 225.1 bgoncalves 5/1 14:19 15+12:39:24 R 0 361.3 Surf.x input/Surf.in 12 jobs; 0 idle, 12 running, 0 held
tells me that in the queue managed by the machine “underdark”, user “bgoncalves” submitted job 225.1 on May 1st at 14:19 that has been running (R) for 15 days, 12 hours, 39 minutes and 24 seconds using 361.3 MB of RAM with the command line (possibly truncated) “Surf.x input/Surf.in”. The last line, lets me know that there are a total of 12 jobs on queue and that all are running.
You can also check if something went wrong by using:
666 | condor_q -analyze ID |
The output should give you an idea of what’s going on.
condor_rm ID
If something went really wrong (or you just found a much better way of implementing something, for instance) and you are no longer interested in letting your job finish, you can remove a process from Condor’s job queue by typing:
666 | condor_rm ID |
where ID is the jobs ID number that you can get from a previous call to condor_q
Further information, very detailed documentation and FAQs can be found in Condors website: http://www.cs.wisc.edu/condor/
Sphere: Related Content

Blog Index
Subscribe via Email

September 17th, 2007 at 11:41 pm
Hello
Very interesting information! Thanks!
Bye