Using Condor to Run Sims (VLSI, ULP and Arch Group)

Overview:

Our group currently has several server machines running the job sharing application “Condor.” Over the past few months I have been running many circuit simulations some of which can take days. As a result is makes sense to harness the power of this computing cluster. Condor is pretty simple, just take a script or other file that you have written and upload it to the cluster. Your job will automatically be sent to an available machine, and when its done all of the output will appear in the directory that you started the job from, and condor will email you to say your job is done.

Submitting a Job

- the executable. Condor can run any executable, however if you have lots of parallel jobs to do I recommend using a perl script with different command line parameters. Here is an example script [char.pl] that I used to test many different process tech configurations, here are the related hspice files[ring_a.sp, ring_l.sp]. The condor machines will automatically mount /home/cktcad

- the command file. The next step before submitting a job is to create a condor command file. This tells condor what files to upload, and what program to execute. Here is an example, ckt.cmd.
- the “executable” field specifies your executable
- the “transfer_input_files” flag tells condor what files to transfer (i.e. Spice files)
- the other files “log”, “output” “err” etc.. are where condor records stdout, stderr etc.. for each process.
- executable output like spice traces or text files are automatically transferred when the job is complete
- “arguments” these are the command line arguments to supply to your executable. Each set followed by “queue is a separate condor job
- there are many other options!! check out the condor manual (link below) for details.

Now you are ready to submit. Log on to one of the condor machines (see below) and cd into the subdirectory with your files (executable, input, command). Run the following command

%condor_submit <command_file>

Condor should report “submitted X jobs to cluster N”

That's it your jobs are submitted!

Checking on Job Status

Use the following commands when logged into a condor machine to check on the status of your jobs. There are many others so check out the condor manual.

%condor_q
(prints the jobs currently submitted to condor, along with memory size, running status, and time so far)

%condor_q -global
(displays the entire condor queue not just the machine you are logged in to)

%condor_q -analyze
(displays all of the jobs and why they are currently not running. This helps debug problems with job matching)

%condor_rm <job or cluster number>
(removes jobs from the queue)

%condor_status
(displays the status of all of the CPUs in the condor cluster)

%/usr/local/condor/bin/condor_prio -p <priority> <job or cluster number>
(change the priority of a job or cluster, priority is an int between -20(lowest) to 20(highest), these priorities are used only when selected jobs by the same user. Priorities amoung users are different)
i.e. %/usr/local/condor/bin/condor_prio -p 10 20.1
changes the priority to 10 of job 20.1

%condor_version
(displays the version of condor running on the current computer)


Condor Machines (more to come)

  • tiamat (server)

  • typhon (server)

  • fafnir (server)

  • labbu (server, condor master)


FAQs



Links

  • Condor Project Homepage (manuals and other good stuff)