Notes
Powered by Gregarious (33)
Go to Post Index Blog Index
Subscribe Subscribe
Subscribe to RSS feed via Email Subscribe via Email
Sphere: Related Content
 

How to build a linux Cluster - Part III

Filed under Cluster, How-to.

Viewed 1819 times times.

 

 

Series table of contents:

  1. How to build a linux Cluster - Part I
  2. How to build a linux Cluster - Part II
  3. How to build a linux Cluster - Part III

This post series documents how I built a powerful and scalable Linux cluster using only free software and off the shelf components. To build our cluster we are going to use three pieces of software:

On the first part of the series, I showed you how you can install DRBL on your server machine. On the second part, I explained how to install Condor on your DRBL cluster so you could easily submit and manage your computing jobs. In this third and final installment I give you a brief introduction on to how you can use your newly created cluster.

Condor tutorial

Condor allows for the CPU cycles that are unused in various computers to be available for general use in an efficient and transparent manner, thus dramatically improving the use of already existing computational resources, without affecting their normal use. It also allows for a number of independent jobs to be simultaneously scheduled and run concurrently and as fast as resource availability allows it.

 

SCRIPT file

Before you can submit a job to Condor you need to write a submit script that informs Condor of what is required to complete the computational task. A simple submit script can look something like this:

1
2
3
4
5
6
7
8
Executable = hello
Universe = Vanilla
Output =hello.out.$(PROCESS)
Input = hello.in.$(PROCESS)
Error = hello.err.$(PROCESS)
Transfer_files = ALWAYS
Log = hello.log
Queue 3

The only necessary lines are the first two and the last one. The meaning
of each line is described bellow:

1
Executable = hello

This line tells Condor what binary/script to run. Arguments shouldn’t be passed along with the binary.

6
Transfer_files = ALWAYS

This indicates to Condor that it should send the files to the remote machine. This allows your program to run in machines that don’t have all the necessary libraries installed. You should always include this line, even though it is not mandatory.

2
Universe = Vanilla

Specify the proper condor universe to use. There are several possible choices, but the most commonly used is the Vanilla universe which allows for the use of any executable file. Another useful choice, is the Standard universe that regularly checkpoints your job and, if anything should go wrong, restarts the job from the last available checkpoint but that requires the executable to be compiled and linked with condor_compile as described bellow.

4
Input = hello.in.$(PROCESS)

The contents of this file will be used as <stdin> for this process. The macro $(PROCESS) is replaced by the process number, starting at 0. As such, process number 0 will read in hello.in.0, Process number 1 will read in hello.in.1, etc…

3
Output = hello.out.$(CLUSTER).$(PROCESS)

Any output generated by the binary file will be redirected to this file. As before $(PROCESS) stands for the process number. Condor also assigns a unique cluster job number to every set of jobs (every time you use condor_submit. You can retrieve this number with the $(CLUSTER) macro and use it, as in this case, to give every output file a unique file name.

5
Error = hello.err.$(CLUSTER).$(PROCESS)

Any error messages produced by the processes will be redirected to this file. Similarly to before, $(PROCESS) identifies the process number.

7
Log = hello.log

Condor stores the job log in this file. Any actions taken during the exectution of this job, such as submission, execution, eviction, etc.. will be listed here.

8
Queue N

This line tells Condor how many different processes to queue. If the numerical argument is ommited, only one process is added to the execution queue. The processes thus generated will be run on as many machines as there are available and for as long as required to complete the exectution.

All the files mentioned above are expected to be in the same directory as the submit script and the executable binary. Several commands are used to submit and control jobs:

condor_status

To know how many nodes and cpus are known to Condor you can run:

666
condor_status

List all of the nodes that are running in the Condor that your machine is in. It will also provide basic information about each node, such as the architecture, operating system, and whether or not there are any jobs that are currently running. When I run condor_status on one of my clusters, I would see (the output is truncated for clarity):

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
 
mops001       LINUX       INTEL  Claimed    Busy       1.000   502  1+17:58:14
 
                     Total Owner Claimed Unclaimed Matched Preempting Backfill
 
         INTEL/LINUX    42     0      12        30       0          0        0
 
               Total    42     0      12        30       0          0        0

Which tells me that machine mops001 is running LINUX on an INTEL compatible CPU with 502MB of RAM and that it has been running a job for 1 day 17 hours 58 minutes and 14 seconds. The final part of the output says that all 42 machines are INTEL/LINUX and that only 12 of them are running jobs with the remaining 30 machines being Idle. The Owner column accounts for how many machines are currently being used by someone logged in directly to them. In our case, this will (almost) always be zero, unless there is some administrative job being run by the system that uses a considerable ammount of CPU time.

condor_compile COMMAND

If you want to take full advantage of Condor’s checkpointing and opportunistic computing abilitie, you must use the Standard universe and link your binaries with some condor-specific libraries. This is achieved by calling:

666
condor_compile COMMAND

where COMMAND stands for the command you would normaly used to compile your code. So, if you would normally compile your source code using:

666
gcc foo.c -o foo.x

you should now use:

666
condor_compile gcc foo.c -o foo.x

condor_submit SCRIPT

After we have linked our binary with the condor libraries and have written an appropriate submit script, we need to tell Condor to use it to create a new job. All you do is type:

666
condor_submit SCRIPT

If you want to see what the submit command is doing for debugging
purposes, you can simply type:

666
condor_submit -v SCRIPT

After this command returns all that is left for you to do is wait for the job (s) to finish running. If everything goes according to plan the output of your runs should be in the Output files listed in the submit script and any errors that might have occurred will be in the Error files. Please note that the Error files will always be created, even if no errors occur.

condor_q

You can check the status of your job by typing:

666
condor_q

The output of this command will tell you how many jobs are on Condors queue, which jobs are currently Running and which jobs are Idle. Jobs can be idle simply because they havent gotten their turn yet, or because they encountered some sort of problem. For example (truncated):

-- Submitter: underdark : <xxx.xxx.xxx.xxx:60572> : underdark
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 225.1   bgoncalves      5/1  14:19  15+12:39:24 R  0   361.3 Surf.x input/Surf.in
 
12 jobs; 0 idle, 12 running, 0 held

tells me that in the queue managed by the machine “underdark”, user “bgoncalves” submitted job 225.1 on May 1st at 14:19 that has been running (R) for 15 days, 12 hours, 39 minutes and 24 seconds using 361.3 MB of RAM with the command line (possibly truncated) “Surf.x input/Surf.in”. The last line, lets me know that there are a total of 12 jobs on queue and that all are running.

You can also check if something went wrong by using:

666
condor_q -analyze ID

The output should give you an idea of what’s going on.

condor_rm ID

If something went really wrong (or you just found a much better way of implementing something, for instance) and you are no longer interested in letting your job finish, you can remove a process from Condor’s job queue by typing:

666
condor_rm ID

where ID is the jobs ID number that you can get from a previous call to condor_q

Further information, very detailed documentation and FAQs can be found in Condors website: http://www.cs.wisc.edu/condor/

Sphere: Related Content




One Response to “How to build a linux Cluster - Part III”

Comments RSS
  1. hiutopor Says:

    Hello

    Very interesting information! Thanks!

    Bye

Comments RSS

Leave a Reply




 

© Copyright 2004 Bruno Goncalves - All rights reserved

Valid XhtmlValid CSS

Socialized through Gregarious 33
Close
E-mail It