A Quick Tutorial for beta_grid Cluster
with Sun N1 Grid Engine 6.0


This tutorial is designed for beta people to submit jobs to the beta_grid cluster, built on beta and ICICS servers. Currently there are 21 machines with 41 processors in the cluster. All beta workstations are configured to be able to submit jobs to the cluster. For details, please refer to Sun N1 Grid Engine 6 User's Guide.

A grid is a collection of computing resources that perform tasks. In its simplest form, a grid appears to users as a large system that provides a single point of access to powerful distributed resources. Users treat the grid as a single computational resource. Resource management software such as N1 Grid Engine 6 software(grid engine software) accepts jobs submitted by users. The software uses resource management policies to schedule jobs to be run on appropriate systems in the grid. Users can submit millions of jobs at a time without being concerned about where the jobs run.


 Beta_grid system configuration
The beta_grid cluster uses N1 Grid Engine 6 for resource management. The cluster, beta_grid, includes:
The master host accepts jobs from users and put them in a job queue until the job can be run. It sends the jobs to an execution host or hosts from the queue. The grid engine system manages running jobs and logs the record of job execution when the jobs are finished.

Table 1: Execution hosts in beta_grid cluster
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
aluminum lx26-x86 2 2.00 1011.7M 151.0M 5.0G 4.0K
arsenic lx26-x86 2 2.00 1011.7M 152.0M 5.0G 4.0K
boron lx26-x86 2 2.88 1011.7M 216.4M 4.0G 0.0
cesium lx26-x86 2 2.06 1010.9M 157.8M 5.0G 0.0
delos lx26-x86 2 0.00 2.0G 164.6M 5.0G 8.0K
francium lx26-x86 2 1.99 1010.9M 131.1M 5.0G 0.0
gallium lx26-x86 2 1.98 1010.9M 154.9M 5.0G 0.0
indium lx26-x86 2 2.00 1010.9M 135.1M 5.0G 0.0
lithium lx26-x86 2 2.00 1011.7M 136.5M 5.0G 9.9M
potassium lx26-x86 2 2.01 1011.7M 144.1M 5.0G 13.7M
rubidium lx26-x86 2 2.04 1010.9M 119.5M 5.0G 4.0K
samos lx26-x86 2 2.97 3.8G 457.3M 5.0G 8.0K
saria lx26-x86 2 1.00 3.8G 328.2M 5.0G 0.0
serifos lx26-x86 2 3.00 3.8G 426.6M 5.0G 8.0K
sifnos lx26-x86 1 2.00 3.8G 339.4M 5.0G 4.0K
sikinos lx26-x86 2 2.95 3.8G 370.3M 5.0G 0.0
silicon lx26-x86 2 1.99 1010.9M 126.2M 5.0G 0.0
sodium lx26-x86 2 2.00 1010.9M 123.0M 5.0G 0.0
tellurium lx26-x86 2 2.00 1010.9M 108.7M 5.0G 11.3M
tin
lx26-x86 2
2.00
1010.9M 108.7M 5.0G 11.3M



Preparation

Before you run any grid engine system command, you must first set your executable search path and other environment conditions properly. From the command line, type one of the following commands.


You may want to add these commands to your .login, .cshrc, or .profile files, whichever is appropriate. By doing so, you guarantee proper settings for all interactive session you start later.

To check if your workstation is allowed to submit jobs to the cluster, type
%qconf -ss
If your workstation is not in the list, please contact the cluster administrator for assistance.

Test your workstation with a simple example

You can find an example in the file $SGE_ROOT/examples/jobs/simple.sh. To run a test, type the following command. The command assumes that simple.sh is the name of the script file, and that the file is located in your current working directory.
% qsub simple.sh
The qsub command should confirm the successful job submission as follows:

your job 1 (“simple.sh”) has been submitted

Now enter the following command to retrieve status information about your job.
 %qstat

You should receive a status report that provides information about all jobs currently known to the grid engine system. For each job, the status report lists the following items:

If qstat produces no output, no jobs are actually known to the system. For example, your job might already have finished.

You can control the output of the finished jobs by checking their stdout and stderr redirection files. By default, these files are generated in the job owner`s home directory on the host that ran the job. The names of the files are composed of the job script file name with a .o extension for the stdout file and with a .e extension for the stderr file, followed by the unique job ID. Thus the stdout and the stderr files of your job can be found under the names simple.sh.o1 and simple.sh.e1 respectively. These names are used if your job was the first ever executed in a newly installed grid engine system.



Submitting Batch Jobs to the Cluster

To run a program using the cluster, you need to create a batch job and submit to the cluster. A batch job is a shell script consisting of a sequence of command-line instructions that are assembled in a file. For instance, the following script first compiles the application flow from its c++ source and then runs the application.

Example: A Shell Script: flow.sh

#!/bin/csh
# This is a sample script file for compiling and
# running a sample FORTRAN program under N1 Grid Engine 6
cd TEST
# Now we need to compile the program "flow.f" and name the executable "flow".
g++ flow.c++ -o flow
# Next run the program
flow

To submit this batch job to the cluster, run the following command:
%qsub flow.sh

Since batch jobs do not have a terminal connection, their standard output and their standard error output must be redirected into files. The standard location for the files is in the current working directory where the jobs run. The default standard output file name is job-name.ojob-id, the default standard error output is redirected to job-name>.ejob-id. The job-name is built from the script file name, or the job-name can be defined by the user. See for example the -N option in the submit(1) man page. job-id is a unique identifier that is assigned to the job by the grid engine system.



Monitoring and Controlling Jobs

This section describes how to use the commands qstat, qdel, and qmod to monitor, delete, and modify jobs from the command line.

Monitoring Jobs With qstat

To monitor jobs, type one of the following commands, guided by information that is detailed in the following sections:

%qstat
%qstat -f
%qstat -ext
qstat with no options provides an overview of submitted jobs only. qstat -f includes information about the currently configured queues in addition. qstat -ext contains details such as up-to-date job usage and tickets assigned to a job.

In the first form, a header line indicates the meaning of the columns. The purpose of most of the columns should be self-explanatory. The state column, however, contains single character codes with the following meaning: r for running, s for suspended, q for queued, and w for waiting. See the qstat(1) man page for a detailed explanation of the qstat output format.

The second form is divided into two sections. The first section displays the status of all available queues. The second section, titled PENDING JOBS, shows the status of the sge_qmaster job spool area. The first line of the queue section defines the meaning of the columns with respect to the queues that are listed. The queues are separated by horizontal lines. If jobs run in a queue, they are printed below the associated queue in the same format as in the qstat command in its first form. The pending jobs in the second output section are also printed as in qstat`s first form.

Controlling Jobs With qdel and qmod

To control jobs from the command line, type one of the following commands with the appropriate arguments.     

% qdel arguments
% qmod arguments
Use the qdel command to cancel jobs, regardless of whether the jobs are running or are spooled. Use the qmod command to suspend and resume (unsuspend) jobs already running.

For both commands, you need to know the job identification number, which is displayed in response to a successful qsub command. If you forget the number, you can retrieve it with qstat, as described in previous section.

Here are several examples of the qdel and qmod commands:

% qdel job-id
% qdel -f job-id1, job-id2
% qmod -s job-id
% qmod -us -f job-id1, job-id2
% qmod -s job-id.task-id-range
In order to delete, suspend, or resume a job, you must be the owner of the job or a grid engine manager or operator.

You can use the -f (force) option with both commands to register a job status change at sge_qmaster without contacting sge_execd. You might want to use the force option in cases where sge_execd is unreachable, for example, due to network problems. The -f option is intended for use only by the administrator. In the case of qdel, however, users can force deletion of their own jobs if the flag ENABLE_FORCED_QDEL in the cluster configuration qmaster_params entry is set. See the sge_conf(5) man page for more information.

Monitoring Jobs by Email

From the command line, type the following command with appropriate arguments.

%qsub arguments
The qsub -m command requests email to be sent to the user who submitted a job or to the email addresses specified by the -M flag if certain events occur. See the qsub(1) man page for a description of the flags. An argument to the -m option specifies the events. The following arguments are available:

Use a string made up of one or more of the letter arguments to specify several of these options with a single -m option. For example, -m be sends email at the beginning and at the end of a job.



Frequently Used Commands

The grid engine system provides a set of ancillary programs (commands) for users to do the following tasks:

For grid users, frequently used commands include:

The following set of ancillary programs or commands may be useful for advanced users.

Problem? Contact cluster administrator. Last updated by Kevin Liang, Jan. 13, 2005.